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Abstract 

In  model  based  recognition  the  problem  is  to  locate  an  instance  of  one  or  several  known 
objects  in  an  image.  The  problem  is  compounded  in  real  images  by  the  presence  of 
clutter  (features  not  arising  from  the  model),  occlusion  (absence  in  the  image  of 
features  belonging  to  the  model),  and  sensor  error  (displacement  of  features  from 
their  actual  location).  Since  the  locations  of  image  features  are  used  to  hypothesize 
the  object’s  pose  in  the  image,  these  errors  can  lead  to  “false  negatives”,  failures  to 
recognize  the  presence  of  an  object  in  the  image,  and  “false  positives”,  in  which  the 
algorithm  incorrectly  identifies  an  occurrence  of  the  object  when  in  fact  there  is  none. 
This  may  happen  if  a  set  of  features  not  arising  from  the  object  are  located  such  that 
together  they  “look  like”  the  object  being  sought.  The  probability  of  either  of  these 
events  occurring  is  affected  by  parameters  within  the  recognition  algorithm,  which 
are  almost  always  chosen  in  an  ad-hoc  fashion.  The  implications  of  the  parameter 
values  for  the  algorithm’s  likelihood  of  producing  false  negatives  and  positives  are 
usually  not  understood  explicitly. 

To  address  the  problem,  we  have  explicitly  modelled  the  noise  and  clutter  that  occurs 
in  the  image.  In  a  typical  recognition  algorithm,  hypotheses  about  the  position  of  the 
object  are  tested  against  the  evidence  in  the  image,  and  an  overall  score  is  assigned 
to  each  hypothesis.  We  use  a  statistical  model  to  determine  what  score  a  correct 
or  incorrect  hypothesis  is  likely  to  have.  We  then  use  standard  binary  hypothesis 
testing  techniques  to  decide  the  difference  between  correct  and  incorrect  hypotheses. 
Using  this  approach  we  can  compare  algorithms  and  noise  models,  and  automatically 
choose  values  for  internal  system  thresholds  to  minimize  the  probability  of  making 
a  mistake.  Our  analysis  applies  equally  well  to  both  the  alignment  method  and 
geometric  hcishing. 
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Experimental  results  for  the  army  knife.  Though  (Tq  for  this  image 
group  was  determined  to  be  1.8,  we  see  that  the  predictions  for  a  value 
of  (To  =  2  or  3  are  much  better.  The  columns  indicate  the  Oq  used 
for  the  experiment,  either  Pp  or  Pd  ,  the  total  number  of  hypotheses 
tested,  the  expected  number  of  hypotheses  to  score  above  the  threshold, 
the  actual  number  that  scored  above  the  threshold,  and  the  error  bar 
for  this  value  (we  show  one  standard  deviation  =  yJtPp[\  —  Pp),  t  = 
number  of  trials).  The  actual  Pp  (or  Pd)  is  shown  in  the  next  column, 
and  the  last  column  shows  the  average  distance  of  all  the  hypotheses 
that  passed  the  threshold  from  E[1V//] . 
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6.3 

Experimental  results  for  the  fork  model . 
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6.4  E.xperimental  results  when  all  the  points  in  the  image  bases  tested  come 
from  the  model.  The  first  and  second  columns  contain  the  model  and 
(To  tested.  The  ne.xt  columns  contain  the  specified  Pf  ,  total  number  of 
hypotheses  tested,  and  number  of  hypotheses  that  passed  the  thresh¬ 
old.  The  next  two  columns  contain  the  actual  Pp  for  this  experiment 
and  the  value  for  the  same  experiment  in  which  the  tested  image  bases 
are  not  constrained  to  come  from  the  model  (this  value  was  taken  from 
the  previous  group  of  experiments.  Finally  the  last  column  is  the  error 
bar  for  the  experiment,  which  we  took  to  be  one  standard  deviation 

=  \JtPF{  \  —  Pp)^  t  =  number  of  trials .  94 

6.5  Experimental  results  when  uniform  clutter  is  assumed.  The  first  and 
second  columns  contain  the  model  and  <To  tested.  The  next  columns 
contain  the  specified  Pp  ,  total  number  of  hypotheses  tested,  and  num¬ 
ber  of  hypotheses  that  passed  the  threshold.  The  next  two  columns 
contain  the  actual  Pp  for  this  experiment  and  the  value  for  the  same 
experiment  in  which  the  effective  density  is  calculated  per  hypothesis, 

and  the  threshold  dynamically  reset .  94 
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Chapter  1 


Introduction 


1.1  Motivation 

In  order  to  build  machines  capable  of  interacting  intelligently  in  the  real  world,  they 
must  be  capable  of  perceiving  and  interpreting  their  environs.  To  do  this,  they  must 
be  equipped  with  powerful  sensing  tools  such  as  humans  have,  one  of  which  is  Vision 
—  the  ability  to  interpret  light  reflected  from  the  scene  to  the  eye.  Computer  Vision  is 
the  field  which  addresses  the  question  of  interpreting  the  light  reflected  from  the  scene 
as  recorded  by  a  camera.  Humans  are  far  more  proficient  at  this  visual  interpretation 
task  than  any  computer  vision  system  yet  built.  Lest  the  reader  think  that  this  is 
because  the  human  eye  may  somehow  perceive  more  information  from  the  scene  than 
a  camera  can,  we  note  that  a  human  can  also  interpret  camera  images  that  a  computer 
cannot  —  that  is,  a  human  outperforms  the  computer  at  visual  interpretation  tasks 
even  when  limited  to  the  same  visual  input. 

Model  based  recognition  is  a  branch  of  computer  vision  whose  goal  is  to  detect  the 
presence  and  position  in  the  scene  of  one  or  more  objects  that  the  computer  knows 
about  beforehand.  This  capability  is  necessary  for  many  tasks,  though  not  all.  For 
example,  if  the  task  is  to  navigate  from  one  place  to  another,  then  the  goal  of  the 
visual  interpretation  is  to  yield  the  positions  of  obstacles,  regardless  of  their  identity. 
If  the  task  is  to  follow  something,  then  the  goal  of  the  interpretation  is  to  detect 
motion.  However,  if  the  task  is  to  count  trucks  that  pass  through  an  intersection  at 
a  particular  time  of  day,  then  the  goal  of  the  interpretation  is  to  recognize  trucks  as 
opposed  to  any  other  vehicle. 

Model  based  recognition  is  generally  broken  down  into  the  following  conceptual  mod¬ 
ules  (Figure  1-1).  There  is  a  database  of  models,  and  each  known  model  is  represented 
in  the  database  by  a  set  of  features  (for  example,  straight  edges,  corners,  etc.).  In 
order  to  recognize  any  of  the  objects  in  a  scene,  an  image  of  the  scene  is  taken  by 
a  camera,  some  sort  of  feature  extraction  is  done  on  the  image,  and  then  the  fea¬ 
tures  from  the  image  are  fed  into  a  recognition  algorithm  along  with  model  features 
retrieved  from  the  model  database.  The  task  of  the  recognition  algorithm  is  to  de- 
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Figure  1-1;  Stages  in  model  based  recognition. 

termine  the  location  of  the  object  in  the  image,  thereby  solving  for  the  object’s  pose 
(position  of  the  object  in  the  environment). 

A  typical  recognition  algorithm  contains  a  stage  which  searches  through  pose  hy¬ 
potheses  based  on  small  sets  of  feature  correspondences  between  the  model  and  the 
image.  For  every  one,  the  model  is  projected  into  the  image  under  this  pose  assump¬ 
tion.  Evidence  from  the  image  is  collected  in  favor  of  this  pose  hypothesis,  resulting 
in  an  overall  goodness  score.  If  the  score  passes  some  threshold  6,  then  it  is  accepted. 
In  some  algorithms  this  pose,  which  is  based  on  a  small  initial  correspondence,  may  be 
passed  onto  a  refinement  and  verification  stage.  For  the  pose  hypothesis  generator,  we 
use  the  term  “correct  hypothesis”  to  denote  a  pose  based  on  a  correct  correspondence 
between  model  and  image  features. 

If  we  knew  in  advance  that  testing  a  correct  hypothesis  for  a  particular  model  would 
result  in  an  overall  score  of  S,  then  recognizing  the  model  in  the  image  would  be 
particularly  simple  —  we  would  know  that  we  had  found  a  correct  hypothesis  when 
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we  found  one  that  had  a  score  of  5.  However,  in  real  images  there  may  l)e  clutter, 
occlusion,  and  sensor  noise,  each  of  which  will  affect  the  scores  of  correct  and  incor¬ 
rect  hypotheses.  Clutter  is  the  term  for  features  not  arising  from  the  model;  such 
features  may  be  aligned  in  such  a  way  as  to  contribute  to  a  score  for  an  incorrect 
pose  hypothesis.  Occlusion  is  the  absence  in  the  image  of  features  belonging  to  the 
model.  This  serves  to  possibly  lower  the  score  of  a  correct  hypothesis.  Lastly,  sensor 
noise  is  the  displacement  of  observed  image  features  from  their  true  location. 

If  we  knew’  that  part  of  the  model  was  occluded  in  the  scene  and  yet  we  keep  the 
threshold  for  acceptance  at  5,  w’e  risk  the  possibility  of  the  algorithm’s  not  identifying 
a  correct  hypothesis,  which  may  not  score  that  high.  Therefore,  w'e  may  choose  to 
lower  the  threshold  for  acceptance  to  something  slightly  less  than  5.  However,  the 
lower  we  set  the  threshold,  the  higher  the  possibility  that  an  incorrect  hypothesis 
will  pass  it.  The  goal  is  to  use  a  threshold  which  maximizes  the  probability  that  the 
algorithm  will  identify  a  correct  hypothesis  (called  a  true  detection)  while  minimizing 
the  probability  that  it  accepts  an  incorrect  one  (called  a  false  alarm). 

In  this  thesis,  we  determine  the  implications  of  using  any  particular  threshold  on  the 
probability  of  true  detection  and  false  alarm  for  a  particular  recognition  algorithm. 
The  method  applies  to  pose  hypotheses  based  on  minimal  correspondences  between 
model  and  image  points  (i.e..  size  3  correspondences).  We  explicitly  model  the  kinds 
of  noise  that  occurs  in  real  images,  and  analytically  derive  probability  density  func¬ 
tions  on  the  scores  of  correct  and  correct  hypotheses.  These  distributions  are  then 
used  to  construct  receiver  operating  characteristic  curves  (a  standard  tool  borrowed 
from  binary  hypothesis  testing  theory)  which  indicate  all  possible  triples  of  {thresh¬ 
old.  probability  of  false  positive,  probability  of  true  positive)  pairs  for  an  appropriately 
specified  statistical  ensemble.  W**  have  demonstrated  that  the  method  works  well  in 
the  domain  of  both  simulated  and  actual  images. 


1.2  Object  Recognition  as  Information  Recovery 

To  approach  the  problem  in  another  way,  we  can  think  of  the  object  recognition 
problem  as  a  process  of  recovering  a  set  of  original  parameters  about  a  source.  In  this 
abstraction,  there  is  some  sort  of  information  exchange  between  the  source  and  the 
observer,  the  information  might  be  corrupted  in  some  fashion,  the  observer  receives 
some  subset  of  the  information  with  added  noise,  and  finally,  processes  the  observed 
information  in  one  or  several  stages  to  settle  upon  a  hypothesis  about  the  parameters 
of  interest. 

For  example,  in  the  case  of  message  transmission,  the  parameters  of  interest  are 
the  message  itself,  the  noise  is  introduced  by  the  channel,  and  the  observer  tries  to 
recover  the  original  transmitted  message.  In  sonar  b^lsed  distance  measurement,  the 
parameter  of  interest  is  the  free  distance  along  a  particular  direction  from  a  source, 
the  information  is  the  reflected  sonar  beam,  and  the  perceived  information  is  the 
time  delay  between  sending  and  receiving  the  beam.  The  observer  then  processes  this 
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Figure  1-2:  Recovering  information  from  a  source  over  a  noisy  channel. 

information  to  derive  a  hypothesis  about  the  free  space  along  the  particular  direction. 

In  model  based  vision,  the  parameters  of  interest  are  the  presence  or  absence  of  a 
model  in  a  scene,  and  its  pose  in  three  dimensional  space.  The  information  is  the 
light  which  is  reflected  from  a  source  by  the  objects  in  the  scene,  and  the  perceived 
information  is  the  light  which  enters  the  lens  of  a  camera.  The  information  goes 
through  several  processing  stages  to  get  transformed  into  a  two  dimensional  array  of 
brightness  values  representing  an  image,  and  then  through  several  more  steps  to  come 
to  a  hypothesis  about  the  presence  and  pose  of  any  particular  model. 

In  this  thesis  we  cast  the  problem  as  a  binary  hypothesis  testing  problem.  Let  the 
hypothesis  H  be  “model  is  at  pose  P”.  We  are  trying  to  reliably  distinguish  between 
H  and  H.  It  is  generally  not  always  possible  to  do  this,  especially  as  the  noise  goes 
up,  but  we  can  bound  the  probability  of  error  as  a  function  of  the  statistics  of  the 
problem,  and  can  determine  when  the  noise  is  too  high  to  distinguish  between  the 
two  hypotheses. 


1.3  Overview  of  the  Thesis 

All  of  the  definitions,  terminology,  conventions  and  formulas  that  we  will  use  in  the 
thesis  are  given  in  Appendix  A. 

Chapter  2  explains  the  model  based  recognition  problem  in  more  detail,  and  gives  a 
very  general  overview  of  work  relevant  to  this  thesis.  We  will  define  the  terms  and 
concepts  to  which  we  will  be  referring  in  the  rest  of  the  work. 

In  Chapter  3  we  present  the  detailed  error  analysis  of  the  problem. 

In  Chapter  4  we  present  the  ROC  (receiver  operating  characteristic)  curve,  borrowed 
from  hypothesis  testing  theory  and  recast  in  terms  of  the  framework  of  model  based 
recognition.  The  ROC  curve  compactly  encompasses  all  the  relevant  information  to 
predict  (threshold,  probability  of  false  positive,  probability  of  false  negative)  triples  for 
an  appropriately  specified  statistical  ensemble.  We  also  confirm  the  accuracy  of  the 
ROC  curves  performance  predictions  with  2u:tual  experiments  consisting  of  simulated 
images  and  models. 

Chapter  5  explores  the  effect  of  varying  some  of  the  assumptions  that  were  used  in 
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Chapter  3.  This  chapter  can  be  skipped  without  loss  of  continuity. 

In  the  first  part  of  Chapter  6  we  measure  the  sensor  noise  associated  with  different 
feature  types  and  imaging  conditions.  In  the  second  part,  we  demonstrate  the  appli¬ 
cation  of  ROC  curves  to  the  problem  of  automatic  threshold  determination  for  real 
models  and  images. 

In  Chapter  7  we  discuss  implications  of  our  work  for  geometric  hashing,  a  recognition 
technique  closely  related  to  the  one  analysed  in  the  thesis. 

Finally,  we  conclude  in  Chapter  8  with  potential  applications  and  extensions. 
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Chapter  2 

Problem  Presentation  and 
Background 


We  begin  by  setting  the  context  for  our  problem.  First  we  will  define  the  terms  to 
which  we  will  be  referring  in  the  rest  of  the  thesis.  We  will  then  talk  about  different 
techniques  for  solving  the  recognition  problem,  and  finally  we  will  discuss  how  these 
techniques  are  affected  by  incorporating  an  explicit  error  model. 


2.1  Images,  Models  and  Features 

An  image  is  simply  a  two  dimensional  array  of  brightness  values,  formed  by  light 
reflecting  from  objects  in  the  scene  and  reaching  the  camera,  whose  position  we 
assume  is  fixed  (we  will  not  talk  about  the  details  of  the  imaging  process).  An  object 
in  the  scene  has  6  degrees  of  freedom  (3  translational  and  3  rotational)  with  respect 
to  the  fixed  camera  position.  This  six  dimensional  space  is  commonly  referred  to  as 
transformation  space.  The  most  brute  force  approach  to  finding  €ui  object  in  the  scene 
would  be  to  hypothesize  the  object  at  every  point  in  the  transformation  space,  project 
the  object  into  the  image  plane,  and  perform  pixel  by  pixel  correlation  between  the 
image  that  would  be  formed  by  the  hypothesis,  and  the  actual  image.  This  method  is 
needlessly  time  consuming  however,  since  the  data  provided  by  the  image  immediately 
eliminates  much  of  the  transformation  space  from  consideration. 

Using  image  features  prunes  down  this  vtist  search  space  to  a  more  manageable  size. 
What  is  meant  by  the  term  “image  feature”  is:  something  detectable  in  the  image 
which  could  have  been  produced  by  a  localizable  physical  aspect  of  the  model  (called 
model  feature),  regardless  of  the  model’s  pose.  For  example,  an  image  feature  might 
be  a  brightness  gradient,  which  might  have  been  produced  by  any  one  of  several  model 
features  —  a  sudden  change  in  depth,  indicating  an  edge  or  a  boundary  on  the  object, 
or  a  change  in  color  or  texture.  An  image  feature  can  be  simple,  such  as  “something 
at  pixel  (x,y)”,  or  arbitrarily  complex,  such  as  “a  45®  straight  edge  starting  at  pixel 
(x,y)  of  length  5  separating  blue  pixels  from  orange  pixels”. 
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The  utility  of  features  lies  in  their  ability  to  eliminate  entire  searches  from  considera¬ 
tion.  For  example,  if  the  model  description  consisted  of  only  corner  features  bordering 
a  blue  region,  but  there  were  no  such  corners  detected  in  the  image,  this  information 
would  obviate  the  need  to  search  the  image  for  that  object.  .Another  way  to  use 
features  is  to  form  correspondences  between  image  and  model  features.  This  con¬ 
strains  the  possible  poses  of  the  object,  since  not  all  points  in  transformation  space 
will  result  in  the  two  features  being  aligned  in  the  image.  In  fact,  depending  on  the 
complexity  of  the  feature,  sometimes  they  cannot  be  aligned  at  all  —  for  instance, 
there  is  no  point  in  transformation  space  that  will  align  a  45'^  corner  with  a  curved 
edge.  The  more  complex  the  feature,  the  fewer  correspondences  are  required  to  con¬ 
strain  the  pose  completely.  For  instance,  if  the  features  consist  of  a  'ID  location  plus 
orientation,  only  two  correspondences  are  required  to  solve  for  the  pose  of  the  object. 
If  the  features  are  '2D  points  without  orientation,  then  a  correspondence  between  3 
image  and  model  features  (referred  to  as  a  size  3  correspondencf  )  constrains  the  pose 
completely. 

It  would  seem  intuitively  that  the  richer  the  feature,  the  more  discriminative  power  it 
imparts,  since  one  not  need  check  correspondences  that  contain  incompatible  feature 
pairings.  In  fact,  there  is  an  entire  body  of  work  devoted  to  using  feature  saliency 
[SUSS,  Swa90]  to  efficiently  perform  object  recognition.  It  is  true  that  one  can  use 
more  complex  features  to  prune  the  search  space  more  drastically,  but  the  more 
complex  the  feature,  the  more  likely  there  is  to  be  an  error  in  the  feature  pairing 
process  due  to  error  and  noise  in  the  imaging  and  feature  extraction  processes.  For 
simplicity,  in  this  work  we  consider  only  point  features,  meaning  that  image  and  model 
features  are  completely  characterized  by  their  '2D  and  ZD  locations,  respectively. 


2.2  Categorizing  Error 

^\e  use  the  term  “error’*  to  describe  any  effect  which  causes  an  image  of  a  model  in  a 
known  pose  to  deviate  from  what  we  expect.  The  kinds  of  errors  which  occur  in  the 
recognition  process  can  be  grouped  into  three  categories: 

•  Occlusion  —  in  real  scenes,  some  of  the  features  we  expect  to  find  may  be 
blocked  by  other  objects  in  the  scene.  There  are  several  models  for  occlusion: 
the  simplest  is  to  model  it  as  an  independent  process,  i.e.,  we  can  say  that  we 
expect  some  percentage  of  features  to  be  blocked,  and  consider  every  feature  to 
have  the  same  probability  of  being  occluded  independent  of  any  other  feature. 
Or.  we  can  use  a  view  based  method  which  takes  into  account  which  features  are 
self-occluded  due  to  pose.  More  recently,  Breuel  [Bre93]  has  presented  a  new 
model  that  uses  locality  of  features  to  determine  the  likelihood  of  occlusion; 
that  is.  if  one  feature  is  occluded  under  a  specific  pose  hypothesis,  an  adjacent 
feature  is  more  likely  to  be  occluded. 

•  Clutter  or  Noise  —  these  are  extraneous  features  present  in  the  image  not  arising 
from  the  object  of  interest,  or  arising  from  unmodelled  processes  (for  example. 
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highlights).  Generally  these  are  modelled  as  points  that  are  independently  and 
uniformly  distributed  over  the  image.  These  will  be  i  J  to  as  clutter  or 
sometimes,  ’'random  image  points”. 

•  Sensor  Measurement  Error  —  image  features  formed  by  objects  in  the  scene 
may  be  displaced  from  their  true  locations  by  many  causes,  among  them:  lens 
distortion,  illumination  variation,  quantization  error,  or  algorithmic  processing 
(for  instance,  a  brightness  gradient  may  be  slightly  moved  due  to  the  size  of  the 
smoothing  mask  used  in  edge  detection,  or  the  location  of  point  feature  may  be 
shifted  due  to  artifacts  of  the  feature  extraction  process).  This  may  be  referred 
to  as  simply  "error”. 

The  interest  in  error  models  for  vision  is  a  fairly  recent  phenomenon  which  has  been 
motivated  by  the  fact  that  for  any  recognition  algorithm,  these  errors  almost  always 
lead  to  finding  an  instance  of  the  ob  ject  where  it  doesn't  appear  (called  a  false  posi¬ 
tive),  or  missing  an  actual  appearance  of  an  object  (a  false  negative). 

We  will  present  an  overview  of  recognition  algorithms,  first  assuming  that  none  of 
these  effects  are  present,  and  subsequently  we  will  discuss  the  implications  of  incor¬ 
porating  explicit  models  for  these  processes. 


2.3  Search  Methods 

Much  of  the  work  done  in  model  based  recognition  uses  pairings  between  model 
and  image  features,  and  can  be  loosely  grouped  into  two  categories:  correspondence 
space  search  and  transformation  space  search.  I  will  treat  another  method,  indexing. 
as  a  separate  category,  though  it  could  be  argued  that  it  falls  within  the  realm  of 
transformation  space  search.  The  error  analysis  presented  in  this  work  applies  to 
those  approaches  falling  in  a  particular  formulation  of  the  transformation  space  search 
category.  In  this  section  we  discuss  the  general  methods,  assuming  no  explicit  error 
modeling. 


2.3.1  Correspondence  Space 

In  this  approach,  the  problem  is  formulated  as  finding  the  largest  mutually  consistent 
subset  of  all  possible  pairings  between  mode  and  image  features,  a  set  whose  size  is 
on  the  order  of  m”  (in  which  m  is  the  number  of  model  features,  and  n  is  the  number 
of  image  features).  Finding  this  subset  has  been  formalized  as  a  consistent  graph 
labelling  problem  [Bha84],  or,  by  connecting  pairs  of  mutually  consistent  correspon¬ 
dences  with  edges,  as  a  maximal  clique  problem  [BC82],  and  as  a  tree  search  problem 
in  [GLP84,  GLP87].  The  running  time  of  all  of  these  methods  is  at  worst  exponential, 
however,  at  least  in  the  latter  approach  Grimson  has  shown  that  with  pruning,  fruit¬ 
less  branches  of  the  tree  can  be  abandoned  early  on,  so  that  this  particular  method’s 
expected  running  time  is  polynomial  [Gri90]. 
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2.3.2  Transformation  Space 


In  the  transformation  space  approach,  all  size  G  correspondences  are  tested,  where 
G  is  the  size  of  the  smallest  correspondence  required  to  uniquely  solve  for  the  trans¬ 
formation  needed  to  bring  the  G  image  and  model  features  into  correspondence.  The 
transformation  thus  found  is  then  used  to  project  the  rest  of  the  model  into  the  image 
to  search  for  other  corresponding  features.  The  size  of  the  search  space  is  polynomial. 

to  be  precise.  This  overall  method  has  come  to  be  associated  with  Hut- 
tenlocher  and  Ullman  ([HU87]),  who  dubbed  it  “alignment”,  though  other  previous 
work  used  transformation  space  search  (for  example,  the  Hough  transform  method 
[Bal81]  as  well  as  [Bai84,  FB80,  TM87],  and  others).  One  of  the  contributions  of 
Huttenlocher's  work  was  to  show  that  a  feature  pairing  of  size  3  was  necessary  and 
sufficient  to  solve  uniquely  for  the  model  pose,  and  how  to  do  it.  Another  charac¬ 
teristic  of  the  alignment  method  as  presented  in  [Hut88]  was  to  use  a  small  number 
of  simple  features  to  form  an  initial  rough  pose  hypothesis,  and  to  iterativ'ely  add 
features  to  stabilize  and  refine  the  pose.  Finally,  for  the  pose  to  be  accepted  it  must 
pass  a  final  test  in  which  more  complex  model  and  image  features  must  correspond 
reasonably  well,  for  example,  some  percentage  of  the  model  contour  must  line  up 
with  edges  in  the  image  under  this  pose  hypothesis.  This  last  stage  is  referred  to 
as  “verification”.  Since  it  is  computationally  more  expensive  than  generating  pose 
hypotheses,  it  is  more  efficient  to  only  verify  pose  hypotheses  that  have  a  reasonable 
chance  of  success. 


2.3.3  Indexing  Methods 

Lastly,  we  come  to  indexing  methods.  Here,  instead  of  checking  all  poses  implied 
by  all  size  3  pairings  between  model  and  image  features,  the  search  space  is  further 
reduced  by  using  larger  image  feature  groups  than  the  minimum  of  3  and  to  pair 
them  only  with  groups  in  the  model  that  could  have  formed  them.  This  requires  a 
way  to  access  only  such  model  groups  without  checking  all  of.  them.  To  do  this,  the 
recognition  process  is  split  into  two  stages,  a  model  preprocessing  stage  in  which  for 
each  group  of  size  G',  some  distinguishing  property  of  all  possible  images  of  that  group 
is  computed  and  used  to  store  the  group  into  a  table,  indexed  by  that  property.  This 
preprocessing  stage  takes  time  0(m^}  (where  m  is  the  number  of  model  points).  At 
recognition  time,  each  size  G  image  group  is  used  to  index  into  the  table  to  find  the 
model  groups  that  could  have  formed  it,  for  a  running  time  of  O(n^)  (where  n  is  the 
number  of  image  points). 

At  one  extreme,  we  could  use  an  index  space  of  dimension  2G  (assuming  the  features 
are  two  dimensional)  and  simply  store  the  model  at  all  positions  (a-i,  j/i,  ....re.  t/c)  for 
every  pose  of  the  model.  However,  this  saves  u:  nothing,  since  the  space  requirements 
for  the  lookup  table  would  be  enormous  and  the  preprocessing  stage  at  least  as  time 
consuming  as  a  straight  transformation  space  approach.  The  trick  is  to  find  the 
lowest  dimensional  space  which  will  compactly  represent  all  views  of  a  model  without 
sacrificing  discriminating  power. 
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Lamdan.  Schwartz,  Wolfson  and  Hummel  [LSW87.  HW88]  demonstrate  a  method, 
called  geometric  hashing,  to  do  this  in  the  special  case  of  planar  models.  Their 
algorithm  takes  advantage  two  things  —  first,  for  a  group  of  3  non-collinear  points  in 
the  plane,  the  affine  coordinates  of  any  fourth  point  with  respect  to  the  first  three  as 
bases  is  invariant  to  an  affine  transformation  of  the  entire  model  plane.  That  is,  any 
fourth  point  can  be  written  in  terms  of  the  first  three; 


m3  =  mo  +  o(mi  -  mo)  +  ;^(m2  -  mo). 


We  can  think  of  (o./i)  as  the  affine  coordinates  of  m3  in  the  coordinate  system 
established  by  mapping  mo,  mi,  m2  to  (0.0),  (1,0),  (0, 1).  These  affine  coordinates 
are  invariant  to  a  linear  transformation  T  of  the  model  plane. 

Second,  there  is  a  one-to-one  relationship  between  an  image  of  a  planar  model' in  a  3Z) 
pose  and  an  affine  transformation  of  the  model  plane.  We  assume  that  the  pose  has 
3  rotational  and  2  translational  degrees  of  freedom,  and  we  use  orthographic  projec¬ 
tion  with  scale  as  the  imaging  model.  Then  the  3Z)  pose  and  subsequent  projection 
collapses  down  to  two  dimensions: 
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where  s  is  the  scale  factor,  and  the  matrices  are  the  orthographic  projection  matrix 
and  rotation  matrix,  respectively.  Conversely,  a  three  point  correspondence  between 
model  and  image  features  uniquely  determines  both  an  affine  transformation  of  the 
model  plane,  and  also  a  unique  scale  and  pose  for  an  object  (up  to  a  reflection  over 
a  plane  parallel  to  the  image  plane;  see  [Hut88]). 

Therefore,  suppose  we  want  to  locate  an  ordered  group  of  four  model  points  in  an 
image  (where  the  model’s  ZD  pose  is  unknown).  The  use  of  the  affine  coordinates  of 
the  fourth  point  with  respect  to  the  first  three  as  basis  to  describe  this  model  group 
is  pose  invariant,  since  no  matter  what  pose  the  model  has,  if  we  come  across  the 
four  image  points  formed  by  this  model  group,  finding  the  coordinates  of  the  fourth 
image  point  with  respect  to  the  first  three  yields  the  same  affine  coordinates. 

Geometric  hashing  involves  doing  this  for  all  model  groups  of  size  4  at  the  same  time. 
The  algorithm  requires  the  following  preprocessing  stage:  for  each  model  group  of 
size  4,  the  affine  coordinates  of  the  fourth  point  are  used  eis  a  pose  invariant  index 
into  the  table  to  store  the  first  three  points.  This  stage  takes  0(m'‘),  where  m  is  the 
number  of  model  points.  At  recognition  time,  each  size  3  image  group  is  tested  in 
the  following  way:  for  a  fixed  image  basis  B,  (a)  for  every  image  point,  the  affine 
coordinates  are  found  with  respect  to  B,  then  (b)  the  affine  coordinates  are  used  to 
index  into  the  hash  table.  All  model  bases  stored  at  the  location  indexed  by  the  affine 
coordinates  are  candidate  matches  for  the  image  basis  B.  A  score  is  incremented  for 
each  candidate  model  basis  and  the  process  is  repeated  for  each  image  point.  After 
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all  image  points  have  been  checked,  the  model  basis  that  accumulated  the  highest 
score  and  passes  some  threshold  is  taken  a  correct  match  for  the  image  basis  B. 

In  theory,  the  technique  takes  time  Oin'*  +  in'*).  More  recently.  Clemens  and  Jacobs 
formalized  the  indexing  problem  in  [CJ91]  and  showed  that  4  dimensions  is  the  mini¬ 
mum  required  to  represent  3D  models  in  arbitrary  poses.  All  views  of  a  group  of  size 
4  form  a  2D  manifold  in  this  space,  implying  that  unlike  in  the  planar  model  domain, 
there  exists  no  pose  invariant  property  for  3D  models  in  arbitrary  poses.  Other  work 
involving  geometric  hashing  can  be  found  in  [CHS90,  RH91,  01s93.  Tsa93]. 


2.4  The  Effect  of  Error  on  Recognition  Methods 

All  recognition  algorithms  test  pose  hypotheses  by  checking  for  a  good  match  between 
the  the  image  that  w'ould  be  formed  by  projecting  the  model  using  the  tested  pose 
hypothesis,  and  the  actual  image.  We  will  discuss  exactly  what  we  mean  by  a  “good 
match”  shortly.  The  three  kinds  of  errors  cause  qualitatively  different  problems  for 
recognition  algorithms.  The  effect  of  occlusion  brings  down  the  amount  of  evidence  in 
favor  of  correct  hypotheses,  risking  false  negatives.  The  presence  of  clutter  introduces 
the  possibility  that  a  clutter  feature  will  arise  randomly  in  a  position  such  that  it  is 
counted  as  evidence  in  favor  of  an  incorrect  pose  hypothesis,  risking  false  positives. 
Sensor  error  has  the  effect  of  displacing  points  from  their  expected  locations,  such 
that  a  simple  test  of  checking  for  a  feature  at  a  poi  location  in  the  image  turns  into 
a  search  over  a  small  disk,  again  risking  the  poss’bili'-y  of  false  positives. 

It  would  appear  that  simply  in  terms  of  running  time,  the  search  techniques  from 
(correspondence  space  search  transformation  space  search  —>■  indexing)  go  in  order 
of  worst  to  best.  However,  this  ranking  becomes  less  clear  once  the  techniques  are 
modified  to  take  error  into  account.  The  differences  between  the  approaches  then 
become  somewhat  artificial  in  their  implementations,  since  extra  steps  must  often  be 
added  which  blur  their  conceptual  distinctions. 

Correspondence  space  search  is  the  most  insensitive  to  error,  since  given  the  correct 
model-feature  pairings,  the  globally  best  pose  can  be  found  by  minimizing  the  sum 
of  the  model  to  image  feature  displacements. 

For  transformation  space  approaches,  dealing  with  error  turns  the  problem  into  a 
potentially  exponential  one.  The  reason  is  that  the  transformation  space  approach 
checks  only  those  points  in  the  space  that  are  indicated  by  size  3  correspondences 
between  model  and  image  features.  Though  there  are  many  correct  image  to  feature 
correspondences,  there  is  only  one  globally  correct  pose.  The  poses  implied  by  all 
correct  correspondences  will  be  clustered  near  the  globally  correct  pose  in  transfor¬ 
mation  space,  but  it  is  likely  that  none  of  them  will  actually  land  on  it.  Therefore, 
finding  the  globally  best  pose  will  require  iteratively  adding  model-feature  pairings 
to  the  initial  correspondence  and  minimizing  the  total  error.  However,  for  each  ad¬ 
ditional  pairing,  the  model  point  in  the  pair  can  match  to  any  image  points  which 
appears  in  a  finite  sized  disk  in  the  image.  Assuming  uniform  clutter,  some  fraction 
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k  of  all  the  image  points  will  appear  in  such  a  disk.  If  all  of  them  have  to  be  checked 
as  candidate  matches,  this  brings  the  search  to  size 

To  conclude  the  discussion  of  the  effect  of  noise  on  different  techniques,  we  note  that 
in  general,  the  more  efficient  an  algorithm,  the  more  unstable  it  is  in  the  presence 
of  noise.  This  observation  is  not  really  surprising  since  the  speed/reliability  trade-off 
is  as  natural  and  ubiquitous  in  all  computer  science  as  the  speed/space  trade-off.  In 
the  remaining  discusion  and  throughout  the  thesis,  we  will  be  dealing  solely  with 
transformation  space  search,  and  the  analysis  that  we  present  is  applicable  equally 
well  to  both  alignment  and  geometric  hashing. 


2.5  Error  Models  in  Vision 

The  work  incorporating  explicit  error  models  for  vision  has  used  either  a  uniform 
bounded  error  model,  or  a  2D  Gaussian  error  model.  A  uniform  bounded  error  model 
is  one  in  which  the  difference  between  the  sensed  and  actual  location  of  a  projected 
model  point  can  be  modeled  as  a  vector  drawn  from  a  bounded  disk  with  uniform,  or 
flat,  distribution.  A  Gaussian  error  model  is  one  in  which  the  sensed  error  vector  is 
modeled  with  a  two  dimensional  Gaussian  distribution.  Clutter  and  occlusion,  when 
modeled,  are  done  so  as  uniformly  distributed  and  independent.  Though  there  has 
not  been  a  great  deal  of  this  type  of  work,  there  are  some  notable  examples. 


2.5.1  Uniform  Bounded  Error  Models 

Recently,  Cass  showed  that  finding  the  best  pose  in  transformation  space,  assuming 
a  uniform  bounded  error  associated  with  each  feature,  can  be  reduced  to  the  problem 
of  finding  the  maximal  intersection  of  spiral  cylinders  in  transformation  space.  Stated 
this  way,  the  optimal  pose  can  be  found  in  polynomial  time  (O(m^n^))  by  sampling 
only  the  points  at  which  pairs  of  these  spiral  cylinders  intersect  [Cas90].  Baird  [Bai84] 
showed  how  to  solve  a  similar  problem  for  polygonal  error  bounds  in  polynomial  time 
by  formulating  it  in  terms  of  finding  the  solution  to  a  system  of  linear  equations. 

Crimson,  Huttenlocher  and  Jacobs  [GHJ91]  did  a  detailed  comparative  error  analysis 
of  the  both  alignment  and  geometric  hcishing  method  of  [LSW87,  HW88].  They  used 
a  uniform  bounded  error  model  in  the  analysis  and  concentrated  on  determining  the 
probability  of  false  positives  for  each  technique.  Also,  Jacobs  demonstrates  an  index¬ 
ing  system  for  3D  models  in  [Jac92]  which  explicitly  incorporates  uniform  bounded 
error. 

2.5.2  Gaussian  Error  Models 

The  previous  work  all  used  a  uniform  bounded  error  model  to  analyze  the  effect  of 
error  on  the  recognition  problem.  This  model  is  in  some  ways  simpler  to  analyze,  but 


in  general  it  is  too  conservative  a  model  in  that  it  overestimates  the  effect  of  error. 
A  Gaussian  error  model  will  often  give  analytically  better  results  and  so  it  is  often 
assumed  even  when  the  actual  distribution  of  error  has  not  been  extensively  tested.  It 
can  be  argued,  however,  that  the  underlying  causes  of  error  will  contribute  to  a  more 
Gaussian  distribution  of  features,  simply  by  citing  the  Central  Limit  Theorem.  In 
[Wel92].  Wells  presented  experimental  evidence  that  indicates  that  for  a  TV  sensor 
and  a  particular  feature  class,  a  Gaussian  error  model  is  in  fact  a  more  accurate 
noise  model  than  the  uniform.  Even  when  the  Gaussian  model  is  assumed,  there  is 
often  not  a  good  idea  of  the  standard  deviation,  and  generally  an  arbitrary  standard 
deviation  is  picked  empirically. 

Wells  also  solved  the  problem  of  finding  the  globally  best  pose  and  feature  corre¬ 
spondence  with  Gaussian  error  by  constructing  an  objective  function  over  pose  and 
correspondence  space  whose  argmin  w’as  the  best  pose  hypothesis  in  a  Bayesian  sense. 
To  find  this  point  in  the  space  he  used  an  expectation-maximization  algorithm  which 
converged  quite  quickly,  in  10-40  iterations,  though  the  technique  was  not  guaranteed 
to  converge  to  the  likelihood  maximum. 

Rigoutsos  and  Hummel  [RH91]  and  Costa,  Haralick  and  Shapiro  [CHS90]  indepen¬ 
dently  formulated  a  method  to  do  geometric  hashing  with  Gaussian  error,  and  demon¬ 
strated  results  more  encouraging  that  those  predicted  in  Grimson.  Huttenlocher  and 
Jacobs'  analysis  of  the  uniform  bounded  model.  Tsai  also  demonstrates  an  error 
analysis  for  geometric  hashing  using  line  invariants  in  [Tsa93]. 

Bolles,  Quam.  Fischler,  and  Wolf  demonstrate  an  error  analysis  in  the  domain  of 
recognizing  terrain  models  from  aerial  photographs  ([BQFW'78]).  In  their  work,  a 
Gaussian  error  model  was  used  to  model  the  uncertainty  in  the  camera  parameters  and 
camera  to  scene  geometrj',  and  it  was  shown  that  the  under  a  particular  hypothesis 
(which  in  this  domain  is  the  camera  to  scene  geometry)  the  regions  consistent  with 
the  projected  model  point  locations  (features  in  the  terrain  model)  are  ellipses  in  the 
image. 


2.5.3  Bayesian  Methods 

The  Gaussian  error  model  work  h^ls  used  a  Bayesian  approach  to  pose  estimation, 
/.e...  it  assumes  a  prior  probability  distribution  on  the  poses  and  uses  the  rule 

P(pose  !  data)  =  ’ 

to  infer  the  most  likely  pose  given  the  data.  The  noise  model  is  used  to  determine 
the  conditional  probability  of  the  data  given  the  pose.  In  Bayesian  techniques,  the 
denominator  in  this  expression  is  assumed  to  be  uniform  over  all  possible  poses,  and 
so  can  be  disregarded  ([Wel92,  RH91,  CHS90,  Tsa93]).  This  assumes  that  one  of  the 
poses  actually  is  correct,  that  is,  that  the  object  actually  appears  in  the  image.  The 
pose  which  maximizes  this  expression  is  the  globally  optimal  pose.  However,  if  we  do 
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not  know  whether  the  model  appears  in  the  image  at  all,  we  cannot  use  the  above 
criterion. 


2.6  The  Need  for  Decision  Procedures 

In  general,  it  is  possible  to  find  the  globally  best  pose  with  respect  to  some  criterion, 
but  if  we  have  no  information  as  to  whether  any  of  the  possible  poses  are  correct, 
that  is,  if  we  have  no  information  as  to  the  probability  that  the  model  appears  in  the 
image,  then  we  must  determine  at  what  point  even  the  most  likely  pose  is  compelling 
enough  to  accept  it. 

In  this  thesis  we  address  this  problem  with  respect  to  poses  based  on  size  3  corre- 
sponces  between  image  and  model  features.  We  will  use  the  term  “correct”  hypotheses 
to  denote  correct  size  3  correspondences.  Such  correct  correspondences  indicate  points 
in  transformation  space  that  are  close  to  the  correct  pose  for  the  model  in  the  image. 
Since  transformat ioi'.  space  search  samples  only  those  points  in  transformation  space 
that  are  implied  b}  size  3  correspondences,  what  we  are  doing  is  trying  to  determine 
when  we  have  found  a  point  in  the  space  close  enough  to  the  correct  pose  to  accept 
it  or  to  pass  it  on  to  a  more  costly  verification  stage. 

Suppose  we  were  working  with  a  model  of  size  m  in  a  domain  with  no  occlusion, 
clutter,  or  error.  In  this  case,  a  correct  hypothesis  would  always  have  all  corrob¬ 
orating  evidence  present.  Therefore,  to  test  if  a  hypothesis  is  correct  or  not,  one 
would  project  the  model  into  the  image  subject  to  the  pose  hypothesis  implied  by  the 
correspondence,  and  test  if  there  were  m  image  points  present  where  expected.  We 
call  this  test  a  decision  procedure  and  m  the  threshold.  However,  suppose  we  admit 
the  possibility  of  occlusion  and  clutter,  modeled  as  stated.  Now  it  is  not  clear  how 
many  points  we  need  to  indicate  a  correct  hypothesis,  since  the  number  of  points  in 
the  image  that  will  arise  from  the  model  is  not  constant.  In  particular,  if  there  is 
the  probability  c  for  any  given  point  to  be  occluded,  then  the  number  of  points  we 
will  see  for  a  correct  hypothesis  will  be  a  random  variable  with  binomial  distribution. 
Deciding  if  a  hypothesis  is  correct  is  a  question  of  determining  if  the  amount  of  evi¬ 
dence  exceeds  a  reasonable  threshold.  So  even  without  sensor  error,  we  must  have  a 
decision  procedure  and  with  it,  an  associated  probability  of  making  a  mistake. 

When  we  also  consider  sensor  error,  the  uncertainty  in  the  sensed  location  of  the  3 
image  points  used  in  the  correspondence  to  solve  for  the  pose  hypothesis  magnifies 
the  positional  uncertainty  of  the  remaining  model  points  (Figure  2-1).  Therefore 
since  a  model  point  could  fall  anywhere  in  this  region,  we  have  to  count  any  feature 
which  appears  there  as  evidence  in  favor  of  the  pose  hypothesis.  As  the  regions 
spread  out  spatially,  there  is  a  higher  probability  that  a  clutter  feature  will  appear 
in  such  a  region,  even  though  it  does  not  arise  from  the  model.  So  now,  instead  of 
never  finding  any  evidence  corroborating  an  incorrect  pose  hypothesis  (assuming  only 
asymmetric  models),  the  amount  of  evidence  we  find  will  also  be  a  random  variable 
with  distribution  dependent  on  the  error  model. 
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Figure  2-1:  Possible  positions  of  a  model  point  due  to  positional  uncertainty  in  the  three 
points  used  in  the  correspondence  to  form  the  pose  hypothesis. 

It  is  important  to  understand  the  implications  of  using  any  particular  threshold  as 
a  decision  procedure,  since  when  the  distributions  of  the  two  random  variables  over¬ 
lap,  using  a  threshold  will  necessarily  imply  missing  some  good  pose  hypotheses  and 
accepting  some  bad  ones.  Most  working  vision  systems  operate  under  conditions  in 
which  the  random  variables  describing  good  and  bad  hypotheses  are  so  widely  sepa¬ 
rated  that  it  is  easy  to  tell  the  difference  between  them.  Few  try  to  determine  how 
their  system’s  performance  degrades  as  the  distributions  approach  each  other  until 
they  are  so  close  that  it  is  not  possible  to  distinguish  between  them. 

It  is  this  area  that  is  addressed  in  this  thesis.  Our  approach  focuses  not  on  the  pose 
estimation  problem,  but  rather  on  the  decision  problem,  that  is,  given  a  particular 
pose  hypothesis,  what  is  the  probability  of  making  a  mistake  by  either  accepting  or 
rejecting  it?  This  question  has  seldom  been  dealt  with,  though  one  notable  exception 
is  the  “Random  Sample  Consensus”  (RANSAC)  paradigm  by  Fischler  and  Bolles 
([FB80]),  in  which  measurement  error,  clutter  and  occlusion  were  modeled  similarly 
as  in  our  work,  and  the  question  of  choosing  thresholds  in  order  to  avoid  false  positives 
addressed  as  well.  More  recently,  error  analyses  concentrating  on  the  probability  of 
false  positives  were  presented  in  the  domain  of  Hough  transforms  by  (GH90],  and  in 
geometric  hashing  by  [GHJ91],  and  much  of  the  approach  developed  in  this  thesis 
owes  a  debt  to  that  work. 
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Conclusion 


VVe  have  structured  this  problem  in  a  way  which  can  be  applied  to  those  algorithms 
which  sample  transformation  space  at  those  points  implied  by  correspondences  be¬ 
tween  3  model  and  image  features.  In  the  next  few  chapters  we  will  present  the 
method,  and  its  predictive  power  for  both  simulated  and  real  images. 
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Chapter  3 

Presentation  of  the  Method 


In  this  chapter  the  problem  we  address  is,  given  a  model  of  an  object  and  tin  image, 
how  do  we  evaluate  hypotheses  about  where  the  model  appears  in  the  image? 

The  basic  recognition  algorithm  that  we  are  assuming  is  a  simple  transformation  space 
search  equivalent  to  alignment,  in  which  pose  hypotheses  are  based  on  initial  minimal 
correspondences  between  model  and  image  points.  The  aim  of  the  search  is  to  identify 
correct  correspondences  between  model  and  image  points.  We  will  refer  to  correct 
and  incorrect  correspondences  as  “correct  hypotheses”  and  “incorrect  hypotheses”. 
Correct  hypotheses  specify  points  in  transformation  space  that  are  close  to  the  correct 
pose,  and  can  be  used  as  starting  points  for  subsequent  refinement  and  verification 
stages.  The  inner  loop  of  the  algorithm  consists  of  testing  the  hypothesis  for  possible 
acceptance.  The  steps  are; 

( 1 )  For  a  given  3  model  points  and  3  image  points, 

(2)  Find  the  transformation  for  the  model  which  aligns  this  triple  of  model  points 
to  the  image  points, 

(3)  Project  the  remaining  model  points  into  the  image  according  to  this  transfor¬ 
mation, 

(4)  Look  for  possible  matching  image  points  for  each  projected  model  point,  and 
tally  up  a  score  depending  on  the  amount  of  evidence  found. 

(5)  If  the  score  exceeds  some  threshold  $,  then  we  say  the  hypothesis  is  correct. 

Correspondences  can  be  tested  exhaustively,  or  the  outer  algorithm  can  use  more 
global  information  (such  as  grouping)  to  guide  the  search  towards  correspondences 
which  are  more  likely  to  be  correct.  The  actual  manner  through  which  the  correspon¬ 
dences  are  searched  is  not  relevant  to  the  functioning  of  the  inner  loop. 

In  the  presentation  of  the  algorithm,  steps  (4)  and  (5)  are  deliberately  vague.  In 
particular,  how  do  we  tally  up  the  score,  and  how  do  we  set  the  threshold?  The 
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answer  to  these  two  questions  are  linked  to  each  other,  and  in  order  to  answer  them 
we  need  to  select: 

•  A  weighting  scheme  —  that  is,  when  we  project  the  model  back  into  the  image, 
how  we  should  weight  any  image  points  which  fall  near,  but  not  exactly  at,  the 
expected  location  of  the  other  model  points.  The  weighting  scheme  should  be 
determined  by  the  model  for  sensor  error. 

•  A  method  of  accumulating  evidence  for  a  given  hypothesis. 

•  A  decision  procedure  —  that  is,  how  to  set  the  threshold  6,  which  is  the  score 
needed  to  accept  a  hypotheses  as  being  correct. 

The  first  two  choices  determine  the  distributions  of  scores  associated  with  correct 
and  incorrect  hypotheses.  Different  choices  can  make  the  analytic  derivation  of  these 
distributions  easier  or  harder;  Chapter  5  will  discuss  some  of  t  hese  issues  but  for  now 
we  present  a  single  scheme  for  which  we  can  do  the  analysis. 

After  a  brief  presentation  of  the  mechanics  of  the  alignment  algorithm,  we  will  present 
the  error  assumptions  we  are  using  for  occlusion,  clutter,  and  sensor  noise,  and  how 
these  assumptions  affect  our  scoring  algorithm.  For  the  remainder  of  the  chapter  we 
will  present  a  particular  scoring  algorithm  for  hypotheses,  and  we  will  derive  the  score 
distributions  associated  with  correct  and  incorrect  hypotheses  as  a  function  of  the 
scoring  algorithm.  Once  we  know  these  distributions,  the  question  of  determining  the 
relationship  between  performance  and  the  threshold  used  for  acceptance  will  become 
straightforward. 

In  our  analysis  we  limit  ourselves  to  the  domain  of  planar  objects  in  3D  poses.  We 
assume  orthographic  projection  with  scaling  eis  our  imaging  model,  and  a  Gaussian 
error  model,  that  is,  the  appearance  in  the  image  of  any  point  arising  from  the  model 
is  displaced  by  a  vector  drawn  from  a  2D  circular  Gaussian  distribution.  Because 
much  of  the  error  analysis  work  in  this  domain  has  assumed  a  bounded  uniform 
model  for  sensor  error,  we  will  periodically  refer  to  those  results  for  the  purpose  of 
comparison. 


3.1  Project’ on  Model 

In  this  problem,  our  input  is  an  image  of  a  planar  object  with  arbitrary  3D  pose. 
Under  orthographic  projection  with  scaling,  we  can  represent  the  image  location 
[i/,,  of  each  model  point  [a:,,j/,]^  with  a  simple  linear  transformation: 


Ui 

V, 


(3.1) 
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where  the  transformation  matrix  is  a  2  x  2  non-singular  matrix,  and  [tr-fyV 
translation  vector.  To  easily  see  why  this  is  so,  note  that  when  the  model  is  planar, 
the  coordinate  frame  of  the  model  can  be  chosen  so  that  the  third  coordinate  is  always 
0.  In  this  case  the  3D  transformation  collapses  down  to  two  dimensions: 


0 

0 

1— 

’  n.i  n.2  ri.3  ■ 

■r, 

■  tr  ■ 

■  sri.iJ,  -t-  sri,2i/,  +  lx  ' 

0  1  0 

^■2.1  ^2.2  ^2.3 

y> 

+ 

= 

sr2.iXi  -f  sr2,2j/.  +  ty 

0  0  0 

^'3,1  ^3,2  r3,3 

0  , 

0 

0 

Here,  s  is  the  scale  factor,  and  the  matrices  are  the  orthographic  projection  matrix 
and  rotation  matrix,  respectively.  Conversely,  a  three  point  correspondence  between 
model  and  image  features  uniquely  determines  both  an  affine  transformation  of  the 
model  plane,  and  also  a  unique  scale  and  pose  for  an  object  (up  to  a  reflection  over 
a  plane  parallel  to  the  image  plane:  see  [Hut88]). 


3.2  Image,  Model,  and  Affine  Reference  Frames 

Conceptually,  there  are  three  different  coordinate  frames  we  utilize  during  the  anal¬ 
ysis.  Model  space  is  the  global  reference  frame  used  for  the  model  representation, 
and  image  space  is  the  global  reference  frame  of  the  image.  The  transformation  from 
model  space  to  image  space  is  accomplished  by  the  linear  projection  model  discussed 
above. 

A  third  coordinate  frame,  called  affine  space,  is  used  for  each  correspondence  tested. 
This  coordinate  frame  is  established  by  the  three  model  points  used  in  the  initial 
correspondence  (which  must  not  be  collinear,  or  they  would  not  span  a  plane).  The 
ordered  triple  of  model  and  image  points  used  in  the  correspondence  is  referred  to 
as  the  model  basis  and  image  basis,  respectively.  Each  model  point  can  be  uniquely 
expressed  as  a  linear  function  of  the  model  basis: 

m,  =  mod-  a, (mi  -  mo)  +  /d,(m2  -  mo)  (3.2) 

We  can  think  of  the  vectors  (mi  —  mo)  and  (m2  —  mo)  as  the  unit  basis  vectors  (1,0) 
and  (0.1)  establishing  the  affine  coordinate  frame,  in  which  {oi.fdi)  are  the  affine 
coordinates  of  m,. 

We  convert  from  model  space  coordinates  to  affine  space  coordinates  as  follows:  given 
model  points  mo, mi, m2  (in  model  space  coordinates),  the  coordinates  of  a  fourth 
point  m,  with  respect  to  this  basis  are  given  by  the  expression 


a, 

.  . 
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Figure  3-1:  Calculating  the  affine  coordinates  of  a  fourth  model  point  with  respect  to 
three  model  points  as  basis. 

in  which 

m',  =  mj  -  mo  mj  =  m2  —  ntio  m'  =  m,  -  mo 
<i>  =  Zm'jmj  t  =  /m,m| 

and  L  denotes  the  angle  between  the  two  vectors  of  the  argument. 

If  we  perform  an  affine  transformation  T  plus  a  translation  t  such  as  in  Equation  3.1 
to  both  sides  of  Equation  3.2,  we  demonstrate  that  the  affine  space  coordinates  (q,  ,  /?,  ) 
of  a  model  point  m,  remain  unchanged  with  respect  to  the  transformed  coordinates 
of  the  basis: 

T[m,]  +  t  =  T[mo-t-Q,(mi -mo)-l-/?,(m2-mo)]  +  t 

=  Tmo  -I- 1  +  T[a,(mi  -  mo)]  +  T[/9,(m2  -  mo)] 

=  Tmo  + 1  -t-  Q,(Tmi  -  Tmo)  +  /^•(Tm2  -  Tmo) 

=  [Tmo  +  t]  +  c»i([Tmi  -t- 1]  —  [Tmo  d"  t])  +  /^»([Tm2  -1- 1]  —  [Tmo  "h  t]) 

This  invariance  of  affine  space  coordinates  under  linear  transformations  (which  we  will 
call  “affine  invariance”)  gives  us  several  advantages.  First,  we  can  find  the  projected 
image  location  of  projected  model  points  without  having  to  solve  directly  for  the 
transformation,  since  the  image  locations  of  all  the  model  points  can  be  expressed  by 
such  a  linear  operation.  Therefore,  the  image  location  of  a  projected  model  point  m, 
with  affine  coordinates  (q,, /?,)  with  respect  to  a  given  basis,  once  a  correspondence 
between  model  and  image  points  has  been  established,  is  given  by  the  expression 

S,  =  So  +  Q<(si  -  So)  +  /^,(S2  -  So)  (3.3) 

where  s,  denotes  the  ?th  images  point.  Second,  since  there  is  a  one-to-one  correspon¬ 
dence  between  affine  transformations  and  poses,  the  affine  invariance  of  this  represen- 
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tation  implies  that  it  is  pose  invariant  as  well.  This  is  the  key  to  a  particular  form  of 
indexing,  called  geometric  hashing.  In  particular,  for  the  affine  space  established  by 
every  model  basis,  the  affine  coordinates  of  every  other  model  point  is  used  as  pose 
invariant  indices  into  a  table  into  which  the  model  basis  is  stored. 

Because  the  affine  representation  is  pose  independent,  it  is  the  smallest  model  repre¬ 
sentation  for  indexing  (see  discussion  of  relative  and  absolute  axes,  [CJ91]);  any  other 
smallest  representation  must  necessarily  use  coordinates  which  are  functions  of  the 
coordinates  on  the  a.xes  formed  by  the  model  basis.  Because  of  this,  and  the  recent 
interest  in  affine  coordinates  in  indexing  and  invariance,  it  is  this  representation  that 
we  discuss  and  analyze  in  the  rest  of  the  work.  Using  this  representation,  we  will  be 
able  to  apply  the  analysis  to  both  alignment  and  geometric  hashing. 


3.3  Error  Assumptions 

We  use  the  term  "error”  to  describe  any  effect  which  causes  an  image  of  a  model  in 
a  known  pose  to  deviate  from  what  we  expect,  using  our  projection  model  to  form 
the  image  from  the  model.  There  are  three  kinds  of  error  we  will  be  cissuming  for  the 
analysis  —  occlusion,  clutter,  and  sensor  error. 

Occlusion  occurs  as  a  result  of  some  part  of  the  object  in  the  scene  being  blocked, 
thereby  preventing  the  model  feature  from  appearing  in  the  image.  The  way  we  model 
this  process  is  to  assume  that  all  features  on  the  model  have  the  same  probability  c 
of  being  occluded,  and  that  the  occludedness  of  any  particular  feature  does  not  affect 
any  other.  Though  this  independence  assumption  is  probably  not  accurate,  it  is  often 
assumed  for  the  sake  of  simplicity. 

Any  image  point  which  does  not  arise  from  the  model  is  referred  to  as  clutter.  We 
assume  that  these  points  will  be  independently  and  uniformly  distributed  over  the 
image. 

Lastly,  we  refer  to  the  difference  between  an  image  feature's  observed  to  actual  loca¬ 
tion  as  ‘“sensor  error”.  This  displacement  may  arise  due  to  artifacts  of  the  imaging 
or  feature  extraction  process.  We  assume  the  same  standard  deviation  of  the  sensed 
error  for  all  points  from  the  same  image,  denoted  by  ao-  The  actual  value  of  (Tq  w’ill 
depend  on  things  such  as  lighting  conditions,  camera,  and  feature  type  used,  and  may 
change  from  image  to  image.  In  the  next  section  we  will  derive  the  effect  this  sensor 
error  has  on  the  possible  projected  locations  of  model  points  in  the  image. 


3.4  Deriving  the  Projected  Error  Distribution 

In  this  section  we  give  an  expression  for  the  possible  locations  of  the  projected  model 
points  as  a  function  of  error  in  the  observed  image  locations  of  the  basis  points.  Any 
point  which  appears  at  one  of  these  locations  may  have  arisen  from  this  hypothesis. 


and  may  be  counted  in  favor  of  its  being  correct.  We  use  both  a  uniform  bounded 
error  model  and  a  Gaussian  error  model  for  the  purpose  of  comparison. 


3.4.1  Uniform  Bounded  Error 

We  are  assuming  that  the  sensed  location  of  a  point,  s,,  is  displaced  from  its  actual 
location  by  a  vector  drawn  from  a  uniform  bounded  distribution.  Let  us  use  s,  to 
denote  the  true  location  of  the  point,  and  e.  to  denote  the  error  vector.  Therefore, 
for  every  image  point, 

s,  =  s,  +  e,. 

Let  us  assume  that  {mo, mi.nij)  :  {80,81.82}  is  a  correct  image  to  model  correspon¬ 
dence,  and  let  be  the  affine  space  coordinates  of  a  fourth  model  point  m,. 

Then  the  true  image  location  cf  m,  is  a  function  of  true  image  locations  of  {so.81.S2}: 

s,  =  (1  -  Of,  -  l3i)^  +  a, Si  -f  ,i,S2 

However,  the  computed  location  of  m,  is  a  function  of  the  locations  of  the  image  basis 
points.  We  will  denote  the  computed  location  as  s,,  and 

S,  =  So-l-ai{Si  -So)-l- A(82  -So) 

=  ( 1  -  O,  -  3,  )So  +  OiSi  -f  3iS2 

The  expression  for  the  displacement  vector  for  the  projected  model  point  is  given  by 
the  difference  between  its  computed  and  true  location: 

Si-s,  =  (1  -  a,  -  A)so  +  a,s, -f /?,S2  -  (1  -  a,  -  A)so  +  a,si -f /^,S2 

=  ( 1  —  a,  —  /^,)[so  +  Co]  -f-  o,[si  +  Cl]  -|-  /^<[s2  +  ©2] 

-( 1  -  Qi  -  3i)so  +  a, Si  +  3ih 

=  (1  —  a,  —  3i)^o  +  Qf,ei  -h  3i^2 


When  the  error  vectors  e,  are  drawn  from  a  uniform  circular  distribution  with  radius 
Co,  the  vector  given  by  this  expression  was  shown  in  [GHJ91]  to  be  distributed  over  a 
disk  with  radius 


co(  I  1  —  a,  —  /?,  I  -f  I  o,  I  -f  I  /?,  I  -f  1 ). 


(3.4) 


3.4.2  Gaussian  Error 

For  the  2D  Gaussian  error  model,  we  will  use  the  terminology  X  ~  7V(m,  a^)  to  denote 
that  the  random  variable  ,V  is  normally  distributed  with  mean  m  and  variance 
Also,  E[,Y]  denotes  the  expected  value  of  the  random  variable  X.  We  assume  a  hxed 
standard  deviation  (Tq  for  the  error  distribution,  and  proceed  as  follows' . 
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Let  s,  =  true  image  location  of  model  point  m,: 

s,  =  (1  -  Q,  -  A)so  +  a, Si  +  l3iS2. 

Let  s,  =  observed  image  location  of  m,: 

Si  =  Si  +  ei 

where  ei  ~  N(0,(Tq).  Then  Si  is  a  random  variable,  Si  ~  iV(si,<TQ). 
Let  Si  =  computed  image  location  of  mi: 


Si  =  (1  -  Q,  -  0i)So  +  a, Si  +  /^iS2. 


The  projected  error  distribution  in  which  we  are  interested  is  the  difference  between 
the  computed  and  observed  image  location  of  mi: 

Asi  =  Si  —  Si 

E[As,]  =  E[s,]  -  E[si] 

=  E[sij  -  (( 1  -  Oi  -  /?i)E[so]  +  aiE[si]  +  /?iE[s2]) 

=  Si  -  (( 1  -  a.  -  l3i)so  +  QiSi  +  /?iS2) 

=  0 

For  the  covariance  matrix,  we  want  to  find  the  relation  between  any  two  random 
variables  Asi  and  As^: 


Cov(Asi,  Asj) 


E  [Asi  AsJ  —  E[  Asi]  E  [AsJ  j 
e[(s.  -Si)(Sj  -Sj)^]  -0 


E[(si  +  Ci  —  (1  —  Qi  —  ^i)[so  +  Co]  +  Q!i[si  +  ei]  +  /?i[s2  +  ©2])  * 

(Sj  +  €j  —  (1  —  Oj  —  ^j)[So  +  Co]  +  +  Cl]  +  /^j[s2  +  62])^] 

E[(e,  -  (1  -  Qi  -  0i)eo  +  Qiei  +  /?ie2)  * 

(ej  -  (1  -  Qj  -  /3j)eo  +  QjCi  +  f3je2f] 


Since  all  the  ei’s  are  independent,  all  terms  ^  j  disappear  when  we  multiply 

and  average,  leaving 

e[(1  -  Qi  -  ^i)(l  -  Qj  -  /JjjeoCo  +  QiQjeief  +  I3i^je2el  +  eiej]  =  (3.5) 

f  [( 1  -  Qi  -  /?i)(  1  -  Qj  -  /?j)  +  QiQj  +  0i8j]  all  j 

\[(l  -  Oi  - /3if  +  q]  +  I3f  +  l]all  i=j 

where  /  is  the  identity  matrix.  The  difference  between  the  terms  for  i  =  j  and  i  ^  j 
comes  from  the  fact  that  in  the  former  case,  E|eief|  =  all,  while  in  the  latter  case 
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T 

e.ej 


=  0.  So,  the  distribution  of  the  error  vector  for  the  *th  point  is  a  circular 


E 

Gaussian  with  variance 


=  cr^[(  1  -  Qi  -  l3if  +  of  +  I3f  +  1]  (3.7) 

The  fact  that  the  distribution  has  non-zero  covariance  (Equation  3.6)  indicates  that 
the  error  vectors  for  the  different  projected  points  are  not  independent.  Since  this 
dependence  is  difficult  to  take  into  consideration,  we  will  assume  that  they  are  inde¬ 
pendent  and  proceed  with  the  analysis.  This  assumption  will  cause  us  to  underesti¬ 
mate  the  true  variance  of  the  score  distribution  for  correct  hypotheses,  as  we  will  see 
in  a  later  section. 


3.5  Defining  the  Uniform  and  Gaussian  Weight 
Disks 

In  our  recognition  algorithm,  all  size  3  correspondences  are  searched  through  in  order 
to  find  a  good  pose.  Each  correspondence  between  model  and  image  points  is  used  to 
project  the  rest  of  the  model  points  into  the  image,  and  for  each  projected  model  point 
location,  if  an  image  point  appears  at  that  location,  this  is  counted  towards  a  total 
score  for  this  hypothesis.  If  after  checking  all  the  projected  model  point  locations  the 
total  score  exceeds  some  threshold,  this  hypothesis  is  accepted. 

When  a  correct  correspondence  is  tested  in  the  absence  of  any  error,  there  will  always 
be  an  image  point  at  the  exact  projected  location  of  every  model  point.  When  we 
take  sensor  error  into  account,  then  any  image  point  appearing  within  the  range  of 
the  projected  error  distribution  is  a  match  candidate  for  the  projected  model  point. 
It  is  clear  that  the  larger  the  distribution,  the  more  likely  it  is  that  a  random  image 
point  will  appear  within  its  range. 

In  all  previous  work  involving  analyzing  a  bounded  uniform  sensor  error  model,  the 
method  of  scoring  a  point  which  appears  inside  the  range  of  the  projected  error 
distribution  has  been  to  accord  it  a  full  vote.  Though  the  projected  error  distribution 
is  in  fact  not  uniform,  these  analyses  have  implicitly  treated  it  as  though  it  were,  by 
according  the  same  score  to  any  point  which  appears  inside  it. 

In  order  to  differentiate  the  scoring  method  from  the  error  model,  we  will  define  an 
entity  called  a  weight  disk,  whose  height  at  every  point  determines  the  score  of  an 
image  point  which  appears  at  that  location.  For  example,  the  weight  disk  implied  by 
the  scoring  scheme  just  mentioned  is  a  disk,  centered  at  the  projected  model  point 
location,  with  height  1  and  radius  given  by  Equation  3.4.  This  will  be  called  the 
“uniform  weight  disk” .  Though  an  optimal  weight  disk  for  a  given  error  model  may 
exist,  it  may  also  be  difficult  to  derive  or  use.  In  general,  we  will  speak  of  comparing 
weight  disks,  rather  than  error  models,  unless  we  are  comparing  the  optimal  weight 
disks  for  the  error  models  involved. 
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We  now  define  the  weight  disk  which  we  will  use  to  assign  scores  to  points  appearing 
in  locations  consistent  with  the  projected  error  distributions,  assuming  a  Gaussian 
model  for  sensor  noise.  Because  the  projected  Gaussian  distribution  is  unbounded,  it 
could  give  rise  to  a  point  appearing  anywhere  in  the  image  with  non-zero  probability. 
In  practice  though,  we  will  ignore  all  points  appearing  outside  a  disk  of  radius  2ct 
from  the  center.  The  reason  for  this  is  to  reduce  run  time  and  will  become  clear  when 
we  discuss  geometric  hashing.  Because  of  this  limitation,  we  will  assign  a  value  of  0 
to  points  from  the  part  of  the  distribution  extending  from  2cr  to  oc; 
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27r<T^ 
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r  =ii 
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2(t  cr^ 
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That  is.  we  will  miss  an  image  feature  arising  from  a  model  point  13.5%  of  the  time. 

Next,  for  a  point  falling  within  the  range  of  the  truncated  distribution,  we  will  assign 
weights  according  to  their  proximity  to  the  disk  center.  The  weight  is  chosen  to  be: 


V 


1 

2jrcr2^ 


where  d  =distance  from  the  point's  location  to  the  disk  center.  This  is  the  actual 
height  of  the  2D  Gaussian  distribution  at  the  location  where  the  image  point  appears. 
Again,  this  weighting  is  not  optimal  for  this  error  model,  and  we  will  discuss  different 
weighting  schemes  and  their  implications  in  a  Chapter  5.  Therefore,  the  Gaussian 
weight  disk  is  a  Gaussian  distribution,  centered  at  the  projected  model  location,  and 
truncated  at  2a  from  the  center. 


Figure  3-2  illustrates  the  projected  Gaussian  and  uniform  weight  disks.  The  figure 
shows  that  the  Gaussian  weight  disks  are  smaller  and  more  dense  at  the  center  than 
the  uniform  weight  disks;  this  can  also  be  seen  by  comparing  the  analytic  expression 
for  the  radius  of  the  uniform  weight  disk  (Equation  3.4)  against  the  radius  of  the 
Gaussian  weight  disk  (where  a  is  given  in  Equation  3.7): 


to(  I  1  ~  o,  —  I  -|-  1  Qi  I  -t-  I  I  4-1 )  >  2aQyJ{  \  —  q,  —  /i,)^  -|-  of  -t-  3f  -|-  1 
This  inequality  holds  because  of  the  triangle  inequality.  For  the  comparison,  to  =  2ao. 


3.6  Scoring  Algorithm  with  Gaussian  Error 

The  exact  method  of  determining  a  score  for  a  correspondence  is  given  by  the  following 
algorithm: 
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Figure  3-2:  The  top  and  bottom  figures  show  the  location  and  density  of  the  projected 
uniform  and  Gaussian  weight  disks,  respectively.  The  darkness  of  the  disk  indicates  the 
weight  accorded  a  point  which  falls  at  that  location.  The  three  points  used  for  the  matching 
are  the  bottom  tip  of  the  fork  and  the  ends  of  the  two  outer  prongs.  The  image  points  found 
within  the  weight  disks  are  indicated  as  small  white  dots.  Note  that  the  uniform  disks  are 
bigger  and  more  diffuse  than  the  Gaussian  disks. 
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(a)  for  a  hypothesis  (mo,mi,m2)  :  (sq, 81,82),  find  all  Gaussian  weight  disk  loca¬ 
tions  and  sizes: 

(i)  find  affine  coordinates  itij  =  (Qj,  Jj)  with  respect  to  btisis  (mo. mi. m2) 

(ii)  projected  image  location  for  m^  is  80  -h  Qj(si  —  80)  -)-  ^j(s2  —  80) 

(iii)  projected  Gaussian  weight  disk  radius  for  mj  is  2cr  =  2(Qj  +  +  [1  —  Oj  — 

(b)  for  every  image  point  Sj.  initially  set  d  =  cxi. 

(i)  find  the  minimum  distance  d  between  8_,  and  mj  such  that  d  <  2<t. 

(ii)  add  r  =  ^  height  of  the  Gaussian  weight  disk  at  the  image 

point  location)  to  the  sum  tc,  which  is  the  total  score  for  this  hypothesis. 
If  this  image  point  did  not  come  within  2a  of  any  projected  model  point, 
then  V  =  0. 


The  collection  method  seems  somewhat  nonintuitive  in  that  we  accumulate  evidence 
from  every  image  point,  instead  of  taking  the  contribution  from  at  most  one  point  per 
projected  weight  disk.  The  reason  we  chose  to  associate  a  random  variable  with  each 
image  point,  rather  than  each  weight  disk,  is  that  it  is  difficult  to  work  with  sums 
of  random  variables  whose  density  function  involves  the  max  function.  In  Chapter  5 
we  will  examine  the  implications  of  accumulating  weight  from  at  most  a  single  image 
point  per  weight  disk. 

Now  that  we  have  selected  a  weighting  scheme  and  a  particular  algorithm  for  accu¬ 
mulating  scores  for  hypotheses,  we  can  determine  the  score  density  associated  with 
correct  and  incorrect  hypotheses.  As  we  can  see  from  step  (b:ii)  in  the  algorithm,  the 
score  is  a  sum  over  the  individual  weights  from  the  n  image  points  (not  including  the 
three  used  for  the  basis  correspondence).  First  we  will  define  the  random  variables 
describing  the  score  contributions  from  the  individual  image  points. 

Suppose  we  are  testing  a  correct  hypothesis.  Then  a  particular  model  point  m,  will 
give  rise  to  an  image  point  which  falls  within  the  projected  Gaussian  weight  disk  for 
model  point  m,  86.5%  of  the  time,  since  the  weight  disk  only  extends  to  a  radius 
of  2ae.  The  weight  that  this  image  point  yields  using  our  weighting  scheme  ran  be 
described  by  a  random  variable  which  we  will  call  Vm-  For  convenience  we  will  refer 
to  such  an  image  point  as  a  “true”  image  point.  To  demonstrate  what  this  means, 
in  the  simpler  bounded  uniform  error  case  and  with  c  denoting  the  probability  of 
occlusion,  the  density  of  Vm  is: 

fvM(v)  =  c6(0)-f  (1 -c)6(v- 1) 

where  6  is  the  unit  impulse  function.  This  indicates  the  fact  that  when  we  are  testing 
a  true  image  point,  it  will  always  appear  inside  a  projected  v/eight  disk  and  contribute 
a  score  of  1,  unless  it  is  occluded,  in  which  case  it  contributes  0. 
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We  also  define  the  random  variable  to  describe  the  weight  that  a  random  point 
will  yield  for  a  tested  hypothesis,  where  the  term  “random  point"  is  taken  to  mean  an 
image  point  which  either  does  not  arise  from  the  model  at  all,  or  that  does  arise  from 
the  model,  but  the  hypothesis  being  tested  is  incorrect.  Note  that  we  are  assuming 
by  this  that  a  point  which  does  arise  from  the  model  will  contribute  the  same  as  a 
clutter  point  when  the  correspondence  being  tested  is  incorrect. 

Next,  w'e  define  the  random  variables  and  W-ff  to  describe  the  cumulative  weight 
of  correct  and  incorrect  hypotheses.  We  let  n  +  3  be  the  number  of  image  points  and 
m  -|-  3  be  the  number  of  model  points.  Then  is  defined  as 

1=1 

where  the  u  in  the  sum  is  due  to  the  fact  that  3  image  points  are  used  in  the  basis 
correspondence.  Note  that  when  we  are  testing  an  incorrect  hypothesis  we  consider 
all  the  image  points  to  be  random,  whether  or  not  the  model  appears  in  the  image. 

The  expression  for  H'//  is  slightly  more  complicated  because  of  occlusion.  If  there 
were  no  occlusion,  then  W'h  would  receive  contributions  from  m  projected  model 
points  and  (n  —  m)  clutter  points,  that  is: 

m  n— m 

H  H  =  Y.  »  «  +  L  ' TT  (3-9) 

1  =  1  1=1 

(3.10) 

When  c  /  0.  we  observe  r?  clutter  points  but  w'e  do  not  know  how  many  of  them  arise 
from  the  model.  The  number  of  projected  model  points  that  we  observe  is  actually  a 
binomially  distributed  random  variable  A/.  Thus. 

M  n-M 

w„  =  y;v'„+  Y.  >57  (3.11) 

i=l  i=l 

P{A/  =  it}  =  (3.12) 

To  discriminate  between  correct  and  incorrect  hypotheses,  we  must  know  the  score 
that  a  correct  hypothesis  is  likely  to  have  versus  an  incorrect  one.  For  this  we  need  to 
first  determine  the  probability  densities  of  the  variables  V\i  and  and  subsequently 
the  densities  of  Vi’//  and  Wfj.  The  derivations  for  the  density  of  Vm  and  Vjf  given 
a  particular  value  for  the  size  of  the  weight  disk  is  straightforward;  however  the  size 
of  the  weight  disk  is  itself  dependent  on  the  affine  coordinates  of  the  model  points. 
We  will  define  another  random  variable,  to  describe  the  values  of  the  standard 
deviation  of  the  projected  Gaussian  error  distribution,  and  we  will  discuss  the  how 
we  estimate  its  density  in  the  next  section.  Once  we  have  this  expression,  we  will  find 
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the  densities  of  Vm  and  Vjf  by  integrating  the  expressions: 


•/o 

/v^(t’)  =  j[  fyjii'^M  I  o')/<7,(o-)d<7 


3.6.1  Determining  the  Density  of  Oe 

The  motivation  for  treating  the  weight  disk  radius  as  a  random  variable  is  that  we 
would  like  to  remove  the  dependence  of  from  the  geometry  of  any  particular  model. 
Rather,  we  would  like  to  find  an  expression  for  the  probability  density  of  the  weight 
disk  radius  over  all  possible  models.  To  do  this,  we  estimate  the  density  by  generating 
thousands  of  models  and  keeping  a  histogram  of  the  affine  coordinates  of  all  the  model 
points  in  the  affine  space  formed  by  randomly  chosen  model  bases. 

Specifically,  the  method  is  as  follows:  when  a  particular  hypothesis  is  being  evaluated, 
each  model  point  is  projected  into  the  image  with  a  weight  disk  whose  radius  is  a 
function  of  the  affine  coordinates  of  the  model  point: 

2<t  =  2<TQyJ{\  -  a,  -  /J,)2  +  af  +  l)f  +  l 

in  which  <to.  the  standard  deviation  of  the  sensed  Gaussian  error,  is  a  constant  which 
must  be  determined  empirically.  We  define  a  random  variable  <t,  which  takes  on  the 
values  of  cr  in  the  above  expression,  and  in  order  to  remove  the  dependence  of  cTj  on 
the  constant  <To,  we  define  another  random  variable 

Pe  =  y/{l  -ai-  Sif  +  a?  +  +  1  (3.13) 

and  we  set 

=  <ToPe  (3.14) 

In  the  analysis  we  use  two  different  probability  densities  for  pe,  one  for  correct  basis 
matchings  and  one  for  incorrect  basis  matchings.  Intuitively,  this  is  due  to  the  fact 
that  when  incorrect  basis  matchings  are  tested,  more  often  than  not  the  projected 
model  points  fall  outside  the  image  range  and  are  thrown  away,  while  when  correct 
hypotheses  are  tested  the  remaining  model  points  always  project  to  within  the  image. 
In  tests  we  have  observed  that  over  half  of  incorrect  hypotheses  tested  are  rejected 
for  this  reason,  leading  to  an  altered  density  for  pe- 

Let  us  call  the  two  densities  and  We  empirically  estimate  the  former 

density  by  generating  a  random  model  of  size  25,  then  for  each  ordered  triple  of 
model  points  as  basis,  we  increment  a  histogram  for  the  value  of  p^  as  a  function  of  o 
and  ^  for  all  the  other  model  points  with  respect  to  that  basis.  For  the  latter  density, 
we  generate  a  random  model  of  size  4  and  a  random  image,  and  histogram  the  values 
of  Pe  for  only  those  cases  in  which  the  initial  bcisis  matching  causes  the  remaining 
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model  point  to  fall  within  the  image.  The  densities  for  p(  found  in  this  manner  have 
been  observetl  to  be  surprisingly  invariant  over  numbers  of  model  points  ranging  from 
I  to  30.  over  numbers  of  image  points  ranging  from  1  to  1000.  and  even  across  ranges 
of  (To  differing  by  as  much  as  10  to  1  (using  a  fixed  image  size). 

The  model  is  constrained  such  that  the  maximum  distance  between  any  two  model 
points  is  not  greater  than  10  times  the  minimum  distance,  and  in  the  basis  selection, 
no  basis  is  chosen  such  that  the  angle  t*  between  the  two  axes  is  0  <1  f  1<  T5  or 
I^Tr  <  c’  <  This  is  done  to  avoid  unstable  bases,  thereby  bounding  the  size 

of  the  affine  coordinates.  For  example,  the  coordinates  of  the  point  P  =  (1.1)  with 
respect  to  the  ba.ses  (1.0)  and  (  — 1,0)  is  (oo.oo).  A  similar  problem  exists  for  the 
same  point  P  with  respect  to  the  bases  (1,0)  and  (0.0).  The  minimum  value  of  p,  is 
found  analytically  by  minimizing  F^piation  3.13  with  respect  to  o  and  3,  and  occurs 
at  o  =  .^  =  i.  where  p,  =  (^)'^^.  The  maximum  value  of  p,  occurs  at  the  boundary 
conditions  discussed  above,  and  was  determined  empirically  to  be  w  40. 

The  results  were  almost  identical  in  every  test  we  ran;  two  typical  normalized  his¬ 
tograms  are  shown  in  Figure  3  3.  The  histograms  very  closely  fit  the  curves 

f^,\H(p)  =  aiP~^  =  1-189 

and 

fpAH^P)  =  =  -1.624 

between  the  ranges  ri  =  ^1-^2  =  40.  Note  that  the  actual  value  for  is  not  crucial, 
given  the  actual  density  functions  —  in  fact,  the  difference  in  the  analysis  using 
r2  =  40  or  r2  =  oo  turns  out  to  be  very  small.  Figure  3-3  shows  the  estimated  density 
functions  for  pe  superimposed  on  the  empirical  density  functions.  The  integral  of  the 
analytic  expression  thus  defined  =  1.018  and  1.052,  respectively. 

Using  Equation  3.14,  the  density  of  p,  implies  the  density  of  (7^: 

JoAhW)  -- 

and 

between  the  ranges  S]  =  (Tori  and  S2  =  (Tor2-  For  the  rest  of  the  work  we  will  work 
with  the  variable  (Tf  rather  than  pe  for  convenience,  keeping  in  mind  that  in  the  final 
analysis,  the  terms  oq, oi.ri  and  r2  are  constants,  and  the  terms  bo^bi^si  and  S2,  are 
variables  dependent  on  them  and  on  the  value  of  (Tq. 
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Figure  3-3:  The  density  functions  fp^\H(p)  and  respectively. 

3.6.2  Determining  the  Average  Covariance  of  Two  Pro¬ 
jected  Error  Distributions 

We  have  already  derived  the  covariance  between  two  projected  error  distributions, 
once  the  model  to  image  correspondence  has  been  fixed.  This  was  shown  to  be 

Cov  ( Asi ,  Asj )  =  [{I  -  Oi  -  0i){l  -  Oj  -  13 j)  +  Q.Oj  -I-  l3i$j]  all 

The  probability  density  of  the  expression  (1  —  a,  —  I3i){l  —  aj  —  l3j)  +  OiOj  -|-  can 
be  estimated  in  a  similar  manner  as  in  the  previous  section  to  determine  the  average 
covariance  between  projected  error  distributions.  The  actual  experiment  performed 
was:  for  1000  randomly  generated  models,  subject  to  the  same  constraints  as  in  the 
previous  section,  25  random  model  bases  were  chosen  (again,  subject  to  the  same 
constraints  as  in  the  previous  section).  For  each  basis,  25  random  model  pairs  were 
tested  for  the  value  of  the  covariance,  and  the  result  histogrammed.  The  results 
indicated  that  the  average  covariance  was  always  positive.  The  implication  of  the 
average  covariance  being  positive  is  that  our  estimate  of  the  variance  of  Wh  will  be 
too  low,  as  we  will  observe. 
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3.7  Deriving  the  Single  Point  Densities 

3.7.1  Finding  fvuiv) 

Given  a  correct  hypothesis  and  no  occlusion,  the  possible  locations  of  a  projected 
model  point  m,  can  be  modeled  as  a  vector  s.+e,,  where  e,  =  [R,  0]^.  Let  us  treat  the 
vector  as  a  pair  of  random  variables  R  and  0  to  avoid  defining  new  notation.  Then 
e,  given  a  fixed  weight  disk  size  is  a  displacement  vector  with  Gaussian  distribution 
(expressed  in  polar  coordinates) 

/ft.eke(r,^  I  <t)  = 

We  now  choose  an  evaluation  function  gir,0),  which  we  use  to  weight  a  match  that 
is  offset  by  e,  from  the  predicted  match  location.  We  want  to  find  its  density,  t.e.,  we 
want  /g(fl.e)|£re(v),  where  the  joint  density  of  R  and  0  is  ais  stated.  As  mentioned,  we 
choose  the  evaluation  function 

We  are  assuming  that  the  value  of  <t  is  fixed  without  actually  denoting  this  in  the 
function  g.  Since  the  evaluation  function  is  a  really  function  of  r  alone,  we  need  to 
know  the  density  function  of  r.  To  find  this,  we  integrate  fR,e\o,{r,0  \  a)  over  9: 

fR\cSf  I  I 

Next,  we  want  to  find  the  density  of  the  weight  function  v  =  g{R).  The  change  of 
variables  formula  for  a  monotonically  decreasing  function  is  given  by  Equation  A.l, 
restated  here: 

=  ww 

Working  through  the  steps,  we  find 


9'{r)  =  -^g{r) 
r 

//iK(r  I  <t)  = 

=  2Ttrg{r) 

fR\<rS9~\v)\a)  =  2-Kg-'^(v)g{g-'^{v)) 
=  2izvg~\v) 


g'ig-Hv))  =  -  = 


-9i9  (i’)) 
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It  may  seem  counterintuitive  that  the  resulting  distribution  is  constant.  However, 
this  can  be  understood  if  one  considers  an  example  in  which  fR,eir,6)  is  uniformly 
distributed.  Integrating  over  all  angles  yields  a  linearly  increasing  function  in  r. 
Assigning  an  evaluation  function  g{r,6)  which  is  inversely  proportional  to  r  yields  a 
constant  density  function  on  |  <t).  The  same  thing  is  happening  here,  only 

quadrat ically.  Since  we  only  score  a  point  if  it  falls  within  a  radius  of  2<t  from  the 
center,  we  miss  the  entire  part  of  the  distribution  from  a  radius  of  2<t  to  oo,  which  as 
we  showed  before  is  e~^.  So  the  probability  density  of  Vm  given  a  fixed  sized  weight 
disk  is: 

(  €~^S(v) 

I  {  25r<T2 

u 

This  expression  correctly  integrates  to  1. 

We  need  to  integrate  this  expression  over  all  values  of  cTf.  Dealing  first  with  the  case 
first  where  v  /  0,  we  get: 

2x<7^bi(T~^d(T 


There  are  two  things  to  take  into  consideration  when  calculating  the  limits  for  the 
integration:  first,  the  possible  values  of  ae  range  from  a  lower  limit  si  to  an  upper 
limit  S2,  due  to  limits  on  the  values  of  the  affine  coordinates.  Also,  for  a  given  cr^  =  <7, 
it  is  clear  that  the  maximum  value  we  can  achieve  is  when  r  =  0  v  = 
minimum  value  we  can  achieve  is  at  the  cutoff  point  r  =  2a-  v  =  Setting 

t’  to  each  of  these  expressions  and  solving  for  a  leads  to  the  conclusion  that  for  a 
particular  value  u,  the  only  values  for  Ce  such  that  g{ei  \  a^)  could  equal  v  are  in  the 
range  ( Therefore  the  lower  bound  on  the  integral  is  <7  =  max(si, 
and  the  upper  bound  is  (7  =  min(^^^,S2)- 

The  bounds  over  which  the  integration  is  performed  is  illustrated  in  Figure  3-4.  The 
third  dimension  of  this  graph  (not  shown)  is  the  joint  density  function  Con¬ 

ceptually  what  we  are  doing  is  integrating  over  the  a  axis.  We  split  this  integral  into 
the  three  3  regions  defined  by  the  integration  bounds,  and  deal  with  the  case  u  =  0 


=  0 

— hrr  ^  v  ^ 

2)r«r*c*  —  —  2ira^ 

otherwise 
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Figure  3-4;  The  figure  shows  the  boundaries  for  the  integration  for  both  fv^iv)  and 
/v'_(v  I  m  =  1).  The  bottom  curve  is  a  —  and  the  upper  curve  is  (t  =  The 

third  dimension  of  the  graph  (not  illustrated)  are  the  joint  density  functions 
and  /v— <T.(v,<r  I  m  =  1). 

separately.  Integrating,  we  get: 

v=0 

27r6i(s2  -  (1  <v  <  t2 

huiv)  =  '  i2  <v  <(3 

”  ■^1)  ^3  <  V  <  £4 

.  0  otherwise 

where 

-  ^  /  _  ) 

^  27rs2^e^  ^  2x52^ 

£3  = - ^ -  (4  =  — — 

2nsi^e'^  27rsi2 

and  Si,S2  are  the  minimum  and  maximum  allowable  values  for  Ce,  respectively.  This 
expression  is  graphed  on  the  left  in  Figure  3-5  for  (Tq  =  2.5.  We  break  the  equation 
into  cases  not  out  of  necessity,  but  for  legibility. 
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3.7.2  Finding /v^(r  I  m  =  1) 

We  do  the  same  derivation  for  the  distribution  \  ni  =  1),  that  is.  the  value  that 

a  random  point  contributes  to  the  basis  of  a  model  with  4  points.  This  is  a  prerequisite 
for  finding  the  distribution  for  the  case  where  the  model  has  m  4-  3  points.  Given  a 
hypothesis  and  a  random  point,  we  calculate  the  distribution  as  follows:  let  event  E 
=  “point  falls  in  a  single  hypothesized  weight  disk".  Given  =  a.  the  probability  of 
event  E  equals  the  area  of  the  weight  disk  divided  by  the  area  of  the  image  .4.  i.t.. 

=  p/J  I  =  4  = 

A  ^  ^  A 

Now  we  calculate  the  probability  that  a  point  which  is  uniformly  distributed  inside  a 
disk  of  radius  2cr  contributes  value  r  for  an  incorrect  hypothesis,  using  the  weighting 
function  defined  in  the  previous  section.  As  before,  we  express  a  uniform  distribution 
as  a  pair  of  random  variables  R.  0,  and  then  integrate  over  0  to  get  the  density  of  R 
alone,  since  the  evaluation  function  ^  is  a  function  of  R: 

/«K(r|<7)  = 

r 

2a'^ 


As  before,  we  calculate  the  density  fg{R)\(,^  given  that  the  clutter  point  falls  in  the 
weight  disk  with  the  new  distribution  for  R  and  get: 


f9(R)wM'  I 


-f(9  ^(^’)) 

9'{9~^v)) 


Therefore,  the  density  of  given  a  single  weight  disk  with  fixed  <7e  is: 

(P{£|,7.  =  <T}«(v)  =  li^]«(,.)  r  =  0 
I  <T.m  =  1)  =  ^  I  <t.E)P{E  \  ir,  =  cr]  =  ^ 


<  V  < 


0 


Ai'  —  —  2jr<T2 

otherwise 


Again,  this  expression  correctly  integrates  to  1.  As  before,  we  need  to  integrate  over 


fv-( r  I  m  =  1 )  =  j  /v^k( I  (7,  m  =  1  fff(<T)da 


/(^) 


bo(T  *da 
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Dealing  with  r  =  0  as  a  separate  case,  and  with  the  same  bounds  as  before,  integrating 
yields: 


where 


fv—{  i' 

•'  A/ ' 


v=0 

-  l)v/27rc 

cn 

VI 

V 

fs  <  «’  <  (4 

^0 

otherwise 

———  (  = 

^  ‘1'KS2^ 


f.T  — 


1 


‘iTTSi^C^ 


(*  = 


‘27rsi 


.2 


This  function  is  illustrated  in  on  the  right  of  Figure  3-5  for  a  value  of  ctq  =  2.5. 

We  ran  two  simulations  to  verify  the  analysis  of  this  section.  In  the  first  one.  we 
tested  the  density  function  /va/(*’)  ^  follows:  we  generated  a  random  model  of  size 
4.  chose  a  random  3D  pose  and  scale,  projected  the  model  into  an  image  adding 
a  Gaussianly  distributed  displacement  vector  to  each  point,  chose  a  correct  image 
to  model  correspondence,  and  histogrammed  the  value  of  the  fourth  point.  This 
was  repeated  15,000  times.  The  second  simulation  differed  only  in  that  instead  of 
projecting  the  model  into  the  image,  a  random  image  wa.s  created  and  a  random 
correspondence  tested.  The  results  of  the  simulations  are  also  shown  in  Figure  3- 
6.  Both  graphs  show  a  normalized  histogram  of  the  results  of  15,000  independent 
trials  excluding  the  value  at  if  =  0.  The  measured  density  of  does  not  fit 

the  prediction  at  c  =  0  because  of  binning  problems  at  that  value,  but  the  rest 
of  the  first  graph  indicates  the  empirical  results  corroborating  the  predictions  very 
closely.  For  the  second  graph,  most  of  the  density  occurs  at  v  =  0:  for  the  remainder 
of  the  distribution  a  chi-squared  test  shows  no  significant  difference  between  the 
empirical  and  analytic  distributions  (probability  =  .98  for  =  160  with  199  degrees 
of  freedom). 


3.8  Deriving  the  Accumulated  Densities 

Having  found  the  single  point  densities,  we  use  them  to  find  the  density  of  the  com¬ 
bined  weight  of  points  for  correct  and  incorrect  hypotheses. 
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Figure  3-5:  Distributions  |  m  =  1)  for  v  >  0.  For  these  distributions, 

ao  =  2.5,  Note  that  the  scale  of  the  y  axis  in  the  first  graph  is  approximately  ten  times 
greater  than  that  of  the  second. 

3.8.1  Finding  fwni'^') 

For  a  model  of  size  m  +  3  and  an  image  of  size  n  +  3,  when  a  correct  hypothesis  is 
being  tested,  then  there  are  M  true  points  in  the  image,  and  n  —  M  random  points, 
where  M  is  binomially  distributed  and 

P{M  =  fc}  = 

The  weight  we  collect  for  this  hypothesis  is  a  random  variable  with  probability  density 

M  n—M 

fw„{w)  =  fvM{v)®”-fvMiv)®fvj^iv)®--fv-j^{v) 

Af  n—M 

=  ®  /vm ( ^’ )  ©  ®  ^'m) 

i=l  i=l 

where  0  denotes  convolution.  The  above  shortened  notation  will  be  used  from  now 
on  for  convenience.  This  formula  assumes  that  each  point  contributes  weight  to  its 
supporting  basis  independently  of  any  other. 

In  order  to  avoid  explicitly  convolving  the  preceding  distributions,  we  find  the  ex¬ 
pected  value  and  the  standard  deviation  of  Vm  and  V^,  and  invoke  the  central  limit 
theorem  to  claim  that  the  combined  weight  of  a  correct  correspondence  between  a 
size  m  +  3  model  in  a  size  n  -|-  3  image  should  roughly  follow  a  normal  distribution. 
The  fact  that  M  is  binomially  distributed  when  c  ^  0  means  that  this  distribution 
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will  not  be  completely  Gaussian,  but  we  will  assume  that  it  is  for  simplicity.  Again, 
N{m,a^)  denotes  a  normal  distribution  with  mean  m  and  variance 

(■  M  n—M  /  A/  71— A/  \  \ 

E  E  Em  +  E  %  •  \X  +  E  Esf  j  1  ■ 

Lastly,  we  use  the  formulas  for  conditional  mean  and  variance,  given  by  Equations 
A. 2  and  A. 3,  restated  here,  to  simplify  the  above  expression: 

E[.Y|  =  E[E[.Y|V1| 

Var  (A')  =  ElVar  (A'  |  >')]  +  Var  (ElA'  |  1']) 


Solving  in  stages,  we  find: 

‘  A/  n  —  M 

E  EEm+  E  Ef  I« 

.1  1 

(A/  n—M  \ 

Ev«+  E  Va\M\ 

■  TA/  n-M  y 

E  E  EEm+  E  Vm\M 
.  L  1  1  J. 

(M  71— A^  \  ■ 

EEm+  E  EkIMj 

(■  M  n-M  \ 


ME[Vm]  +  (n  -  M)E[Vjj] 

MVar(VM)  +  (n  -  M)Var (V^) 
E[M]E[VM]  +  E[n-M]E[Vjf] 

E(M]  Var ( Vm)  +  E[n  -  M]  Var {Vjf) 
Var  (M)  E[Va/]'  +  Var  (n-M)  E[V-mY 


We  use  the  values  E[M]  =  (1  —  c)m  and  Var(M)  =  mc(l  —  c),  and  use  the  above 
formulas  to  find  that: 

U-.  ~  N(A,B) 

in  which 

A  =  ( 1  -  c)mE[VM]  +  [n  -  (1  -  c)m]E[Vjf] 

B  =  (1  —  c)m  Var  (Vm)  +  »Tic(l  —  c)E[VAf]^ 

+[n  -  ( 1  -  c)m] Var  (V^)  +  mc(  1  -  c)E[V5v7]^ 

Solving  for  the  remaining  terms,  we  find 

E[Vm]  =  tvfv^{v)dv 

Jo 
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Integrating  over  the  four  regions  of  the  distribution  and  using  the  equalities 


A 


1 

2irs2^e^ 


(2 


1 

2irs2^ 


— 


1 


2Trsi^€^ 


Ca 


1 

27rsi2 


yields 


E[Vm] 


61 


e"*  —  1 
I2ire* 


Further  substitutions  of  the  equalities 

bi  =  ai(To  Si  =  <Tori  S2  = 


e‘*  —  1 


1  1 


from  Section  3.6.1  yield 

Finally,  the  substitutions  ui  =  1.189, ri  =  =  40,  also  from  Section  3.6.1,  yield 


„3 

^2J 


E[Vm]  =  2.01  X  10-2  X 


<^0 


The  remaining  terms  are  found  using  the  same  steps: 


Var(VM) 


61 


e®-  1 
607r2e6 


e®  —  1 
^^OO^r^cToe® 


9.76  X  10"“  X  4 

E[v^]-E[VMr 


Note  that  the  value  of  the  limit  r2  was  determined  empirically  and  is  a  function  of  the 
constraints  on  the  bases  that  were  chosen.  Without  the  basis  constraints,  r2  tends  to 
infinity,  and  in  fact  the  values  of  these  parameters  for  r2  =  40  and  r2  =  00  are  not 
significantly  different. 

The  values  of  E[V^]  and  Var  (Vjj)  are  derived  in  the  next  section. 
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3.8.2  Finding 

For  an  incorrect  hypothesis  we  look  at  the  problem  in  two  steps.  First  we  derive,  as 
above,  the  mean  and  standard  deviation  of  Vjf  |  r7r  =  1,  i.t.,  the  weight  of  a  single 
random  image  point  that  drops  into  a  single  weight  disk.  From  the  distribution 
fv-(v  I  m  =  1),  we  find: 

Aj 

E[V]g|m  =  l]  =  r/v'_(r  I  771  =  l)dr 

i1 

3cM  [s?  sl\ 

Substituting  bo  =  gq/Jq  from  Section  3.6.1,  we  get: 


E[V^  I  777  =  1] 


ao(e^-  1)  h  _  r 

3e^yl  Tj  rl 


Lastly,  we  note  from  Section  3.6.1  that 


since  this  is  exactly  the  integral  of  the  density  /  , _ Therefore, 

Pe|H 


E[%  I  m  =  1| 


We  continue  with  E  |  tt?  =  1  : 

E[V^1t77  =  i]  =  [  v'^fv-{V  \m  =  \)dv 

20e'*  Att  Sj 

ao(e^-  1)  1  _  J_ 

20e‘‘i47rc7-Q  rf 

Substituting  gq  =  4.624,  rj  =  and  r2  =  40,  from  Section  3.6.1: 

E[Vg- 1  m  =  l]  =  3.52  x  x  -L 
Var(V'i7|m  =  l)  =  E[Vf  |  m  =  l]  -  ElVji,  |  m  =  1]' 
Note  that  the  mean  1  tt?  =  1]  is  not  dependent  on  the  value  of  (Tq. 
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Now,  consider  a  single  random  image  point  (i.e.,  n  =  4;  three  for  the  hypothesis  and 
one  left  over)  dropped  into  an  image  w’here  a  model  of  size  rn  +  3  >  4  is  hypothesized 
to  be.  In  this  case  the  event  that  the  random  point  will  contribute  weight  v  to  this 
hypothesis  is  calculated  as  follows:  Let  event  Ei  =  “point  drops  in  the  ith  weight 
disk."  Then, 


/v— (i’ 

•'a/' 


r  ^  0)  =  ^i)  +  £2)  +  •  •  •  +  ^m) 


where  we  are  eissuming  the  disks  are  disjoint,  hence  we  are  overestimating  the  proba¬ 
bility  of  the  point  falling  in  any  disk.  The  actual  rate  of  detection  will  be  lower  than 
our  assumption,  especially  as  the  m  grows  large. 


'  1  _  m4irbn  r  1 _ J_] 

A  Is,  sjJ 

lo 


v=0 

<  V  <  (2 
(2  <  V  <  (3 
C3<V<  (4 
otherwise 


As  m  grows  large,  (1  —  —  ^])  <  0  so  this  expression  is  no  longer  a  density 

function.  This  is  the  point  at  which  the  model  covers  so  much  of  the  image  that  a 
random  point  will  always  contribute  to  some  incorrect  hypothesis.  Therefore,  this 
analysis  only  applies  to  models  for  which 


A 


m  <  — r  * 
<^o 


4-02034 

<^o 


using  the  equalities  6o  =  ~  and  S2  =  cror2  from  Section  3.6.1.  For  a 

\/A  :  <7  ratio  of  200  :  1,  m  <  800,  and  for  a  ratio  of  50  :  1,  m  <  50. 

The  mean  and  standard  deviation  for  the  weight  of  one  random  point  dropping  into 
an  image  with  m  weight  disks  is: 


E[Vm]  =  ^  '  vfvjf{v)dv 
=  \m  =  l] 

=  mE[Vj^|m  =  l] 

Var{%)  =  E[lf]-E[Vy^ 

=  mE[V'^  I  m  =  l]  -m^E[y^  1  m  =  if 

Dropping  n  points  convolves  this  distribution  with  itself  n  times: 
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H 

Mean 

Variance 

m 

n 

Emp 

Pred 

Emp/Pred 

Emp 

Pred 

Emp/Pred 

1 

1 

3.695E-3 

3.218E-3 

1.15 

1.519E-5 

1.462E-5 

1.04 

1 

100 

3.838E-3 

3.534E-3 

1.09 

1.735E-5 

1.668E-5 

1.04 

1 

500 

4.803E-3 

4.812E-3 

.998 

2.227E-5 

2.498E-5 

.891 

5 

5 

1.966E-2 

1.609E-2 

1.22 

1.493E-4 

7.312E-5 

2.04 

10 

10 

4.199E-2 

3.218E-2 

1.30 

5.413E-4 

1.462E-4 

3.70 

10 

100 

4.451E-2 

3.505E-2 

1.27 

5.340E-4 

1.648E-4 

3.24 

10 

500 

5.548E-2 

4.783E-2 

1.16 

5.748E-4 

2.475E-4 

2..32 

H 

Mean 

Variance 

m 

n 

Emp 

Pred 

Emp/Pred 

Emp 

Pred 

Emp/Pred 

1 

1 

3.24  lE-6 

3.462E-6 

.936 

1.875E-8 

2.25  lE-8 

.833 

1 

100 

3.068E-4 

3.462E-4 

.886 

1.974E-6 

2.251E-6 

.877 

1 

500 

1.634E-3 

1.731E-3 

.944 

1.1163E-5 

1.126E-5 

.992 

5 

5 

8.913E-5 

8.656E-5 

1.03 

6.4808E-7 

5.616E-7 

1.15 

10 

10 

3.495E-4 

3.462E-4 

1.01 

2.400E-6 

2.240E-6 

1.07 

10 

100 

3.508E-3 

3.462E-3 

1.01 

2.328E-5 

2.240E-5 

1.04 

10 

500 

1.629E-2 

1.731E-2 

.941 

1.077E-4 

1.120E-4 

.961 

Table  3.1:  A  table  of  predicted  versus  empirical  means  and  variances  of  the  distribution 
in  fh®  top  table,  and  in  the  bottom  table,  for  different  values  of  m  and  n. 


and  therefore  the  \veight  that  an  n+ 3-size  random  image  contributes  to  an  incorrectly 
hypothesized  model  of  size  m  -f  3  follows  the  distribution: 

JV(nE[VV|,nVar(V'jr)) 

The  means  for  both  distributions  were  tested  empirically  from  the  same  experiment 
as  shown  in  Figures  3-5.  That  is,  for  VF//,  we  generated  a  random  model  of  size  m  -|-  3 
and  projected  it  into  an  image,  adding  a  Gaussian  displacement  error  to  each  point, 
and  adding  n  —  m  additional  clutter  points  (distributed  uniformly  within  the  image). 
We  only  tested  correct  hypotheses,  and  kept  track  of  the  accumulated  weight.  We 
repeated  this  experiment  for  a  given  (m,  n)  pair  until  we  had  over  a  few  thousand 
points.  The  same  was  done  for  W-fj  except  that  the  image  tested  contained  n  random 
points  (i.e.,  the  model  was  not  projected  into  the  image)  implying  that  only  incorrect 
hypotheses  were  tested.  A  table  of  values  for  the  means  and  variances  of  all  the 
experiments  is  given  in  Table  3.1.  For  all  the  experiments,  occlusion  =  0,  <To  =  2.5. 

In  all  the  experiments,  the  m^^ns  are  close  those  predicted  for  the  experiments.  For 
the  predicted  variance  is  also  quite  accurate.  The  underestimate  of  the  variance 
for  Wh  is  due  to  the  fact  that  our  assumption  that  the  true  points  contribute  weight 
independently  of  any  other  true  point  is  false,  and  in  fact  the  average  covariance 
between  pairs  of  projected  error  distributions  is  positive.  This  can  be  seen  also  in 
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Figure  3-6;  Comparison  of  predicted  to  empirical  density  of  Wh^  for  m  =  10,  n  =  10, 
c  =  0,  and  oq  =  2.5.  Note  that  the  empirical  density  has  much  greater  variance  than 
predicted. 


Figure  3-6,  which  shows  the  empirical  versus  analytical  density  of  Wh  for  m,  n,  c,  <To 
=  10,  10,  0.0,  2.5. 


Chapter  Summary 

In  the  analysis,  we  limited  the  model  domain  to  planar  objects.  The  reason  for  this 
was  that  the  analytic  expression  for  the  projected  error  of  the  fourth  model  point, 
which  for  planar  objects  is  given  by  Equation  3.4  and  3.7,  is  not  known  for  either 
the  uniform  or  Gaussian  error  model  when  the  model  is  not  planar.  This  is  the  only 
factor  limiting  the  applicability  of  the  analysis  to  3D  models;  when  such  an  expression 
becomes  available,  this  method  will  easily  be  able  to  incorporate  it. 

In  the  beginning  of  the  chapter  we  asked  the  questions,  how  do  we  accumulate  evi¬ 
dence  for  a  hypothesis,  how  do  we  decide  if  the  hypothesis  is  correct,  and  how  likely 
are  we  to  have  made  a  mistake  in  the  decision?  So  far  we  have  addressed  the  first 
question  by  selecting  the  recognition  algorithm  and  noise  model  that  we  are  using, 
and  deriving  the  probability  density  functions  associated  with  the  scores  that  correct 
and  incorrect  hypotheses  will  accumulate  using  our  algorithm.  In  the  next  chapter 
we  will  discuss  how  to  decide  whether  a  hypothesis  is  correct  or  not,  given  its  score. 
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Chapter  4 

Distinguishing  Correct 
Hypotheses 


In  the  last  chapter  we  selected  an  algorithm,  noise  model  and  weighting  scheme,  and 
using  these  we  derived  expressions  for  /wh  and  the  weight  densities  of  correct 
and  incorrect  hypotheses.  What  we  need  now  is  a  way  to  decide,  given  a  score  for  a 
hypothesis,  if  that  score  is  high  enough  to  warrant  our  deciding  that  the  hypothesis 
is  correct.  In  this  chapter  we  will  show  how  to  use  the  probability  densities  derived 
in  the  previous  chapter  to  do  this.  We  briefly  introduce  the  ROC  (receiver  operating 
characteristic)  curve,  a  concept  borrowed  from  standard  hypothesis  testing  theory, 
and  cast  our  problem  in  terms  of  this  framework. 


4.1  ROC:  Introduction 

Suppose  we  have  the  following  problem:  we  are  observing  a  world  in  which  there  are 
exactly  two  mutually  exclusive  and  exhaustive  events:  Ho  and  Hi .  We  are  given  the 
task  of  deciding  which  one  of  them  is  correct.  The  only  hint  we  have  is  some  quantity 
X  that  we  can  observe.  We  also  know  that  if  Hq  were  true,  then  we  would  observe 
the  value  of  X  distributed  in  some  known  way,  and  similarly  if  Hi  were  true,  i.c., 
we  know  fxi^:  \  Hq)  and  fx{x  \  Hi).  To  relate  this  back  to  the  object  recognition 
problem,  Hq  =  “hypothesis  being  tested  is  incorrect”  and  Hi  =  “hypothesis  being 
tested  is  correct”. 

Let  the  space  of  all  possible  values  of  the  random  variable  X  be  divided  into  two 
regions,  Zq  and  Zi,  such  that  we  decide  Hq  if  the  value  of  X  falls  in  Zq,  and  Hi  if  X 
falls  in  Zi-  Conversely,  we  can  think  of  the  decision  procedure  cis  defining  the  decision 
regions  Zq  and  Zi.  Then  we  can  define  the  quantities 


P{say  Ho  \  Ho  is  true}  =  /  fxi^  I  Ho)dx 
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Pf  =  P{say  Hx  |  Hq  is  true}  =  /  fx{i  |  HQ)dx 

J  Z\ 

Pm  =  P{say  Ho  |  Hx  is  true}  =  /  /a'(^  1  Hx)dx 

Zq 

Pd  =  P{say  Hx  |  Hx  is  true}  =  /  fx{x  |  Hi)dx 

JZi 

These  quantities  are  often  referred  to  as  Pm  =  “Probability  of  a  miss”,  Pq  =  “Prob¬ 
ability  of  detection”,  and  Pf  =  “Probability  of  false  alarm”  for  historical  reasons. 

In  our  problem  we  are  assuming  we  have  no  prior  knowledge  of  the  probabilities  of 
Ho  or  Hx-  In  the  absence  of  such  information,  a  Neyman  Pearson  criterion,  which 
maximizes  Pd  for  a  given  Pf,  is  considered  optimal  [VT68].  This  criterion  uses  a 
likelihood  ratio  test  (LRT)  to  divide  the  observation  space  into  decision  regions,  i.e., 


say  Hi 

fxix\Hx)  > 

fx(x\Ho)  <  ^ 

say  Ho 

That  is,  we  observe  a  particular  value  x,  and  compare  the  conditional  probability 
density  functions  for  that  value  of  x.  If  the  ratio  of  the  conditional  densities  is  greater 
than  a  fixed  threshold  ri,  choose  Hx,  otherwise  choose  Hq.  Changing  the  value  of 
the  threshold  changes  the  decision  regions  and  thus  the  values  of  Pf  and  Pd-  The 
ROC  (receiver-operating  characteristic)  curve  is  simply  the  graph  of  Pd  versus  Pf  as 
a  function  of  varying  the  threshold  for  the  LRT.  The  optimal  performance  achievable 
is  given  by  points  on  the  curve. 

If  the  prior  probabilities  of  Ho  and  Hx  are  known,  then  the  optimal  Bayes  decision 
rule  is  used.  This  test  also  involves  a  likelihood  ratio  test,  in  which  the  threshold  rj 
chosen  to  minimize  the  expected  cost  of  the  decision,  and  is  a  function  of  the  costs 
and  priors  involved: 


{Cxo  -  Coo)Po 

{Coi-Cxx)Pi 

where  C,j  is  the  cost  associated  with  choosing  hypothesis  i  given  that  hypothesis  j  is 
correct.  Pi  is  the  a-priori  probability  that  hypothesis  Hi  is  correct,  and  Cxo  >  Cqo  and 
Cox  >  Cxx  have  been  assumed.  This  point  necessarily  lies  on  the  ROC  curve,  thus 
the  ROC  curve  encapsulates  all  information  needed  for  either  the  Neyman  Pearson 
or  Bayes  criterion. 

For  example,  assume  for  our  problem  that  Ho  ~  N{mo,<TQ)  and  Hx  ~  N(mx,(Tl),  and 
assume  that  mi  >  mo  and  ax  >  ao-  The  likelihood  ratio  test  yields: 
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V 


exp 


say  Hi 

fx(^\Hi) 

> 

fx(x  1  Ho) 

< 

say  Ho 

say  Hi 

> 

< 

say  Ho 

say  Hi 

(x-mo)^  (x-m,)^\ 

> 

2al  2<t?  ) 

< 

say  Ho 

say  Hi 

fx-moY  fx-Tni\^ 

> 

K  (To  J  V  <Ti  / 

< 

say  Ho 

<To 


The  regions  Zq  and  Zi  are  found  by  solving  the  above  equation  for  equality, 

_  ~  -  ^l\  +  (>^0  - 

J  J  — 

Gi-al 

[(mo<T^  -  mi<7^)  +  (TotTi(7l<Tf  -  al]  +  (mp  - 


The  values  of  Pp  and  Pd  are  found  by  integrating  the  conditional  probability  densi¬ 
ties  fx{x  I  Ho)  and  fx{x  \  Hi)  over  these  regions  Zq  and  Zj,  where  Zq  =  {x  :  xi  < 
X  <  X2}  and  Zi  =  Zq. 


Pf  =  [  fx{x  I  Ho)dx  = 
JZ\ 

Pd  =  [  fx{x  I  Hi)dx  = 
J  Zi 


X2  1 

C 


fX2  1 

Jxi  V^2xo'o 
rx2  1 

Jx\  \p2^(T\ 


( J-mi  r 


In  Figure  4-1  for  example,  we  have  plotted  the  ROC  curve  for  the  distributions 
/x(x  I  Ho)  and  fxix  \  Hi)  alongside.  The  axes  are  x  =  Pp,  y  =  Pd-  The  line  x  =  y 
is  a  lower  bound,  since  for  a  point  on  this  line,  any  decision  is  as  likely  to  be  true 
as  false,  so  the  observed  value  of  A'  gives  us  no  information.  Though  an  ROC  curve 
is  a  3D  entity  (i.e.,  a  point  in  (Pp,Pd-,v)  space),  we  display  its  projection  onto  the 
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Figure  4-1:  On  the  left  is  displayed  the  conditional  probability  density  functions  fx{x  | 
Ho)  ~  iV(l,.25)  and  fx{x  |  Hi)  ~  iV(3, 1)  of  a  random  variable  A'.  On  the  right  is  the 
associated  ROC  curve,  where  Pp  and  Pd  correspond  to  the  x  and  y  axes,  respectively.  On 
the  left  graph,  the  boundaries  xi  =  —0.76  and  X2  =  1.42  implied  by  the  value  7  =  —1.76 
are  indicated  by  boxes.  On  the  right,  the  ROC  point  for  this  7  value  is  shown.  The  Pp  and 
Pd  values  are  obtained  by  integrating  the  area  under  the  curves  fx{x  \  Ho)  and  fx{x  |  Hi) 
respectively,  outside  the  boundaries.  The  integration  yields  the  ROC  point  (0.2,0.94). 
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;/  =  0  plane  and  can  easily  find  the  associated  Tj  value  for  any  (Pf^Pd)  pair.  When 
the  threshold  is  infinite  there  is  a  0  probability  of  false  negative,  but  a  0  probability 
of  correct  identification  as  well.  .As  the  threshold  goes  down,  the  probabilities  of  both 
occurences  go  up  until  the  threshold  is  0.  when  both  positive  and  false  identification 
are  certain.  In  our  problem  we  assume  that  we  do  not  have  priors,  so  our  goal  is  to 
pick  a  threshold  such  that  we  have  a  very  high  probability  of  identification  and  a  low 
probability  of  false  positives,  i.t.,  we  are  interested  in  picking  a  point  as  close  to  the 
upper  left  hand  side  as  possible.  Note  that  the  larger  the  separation  between  the  two 
hypothesis  distributions,  the  more  the  curve  is  pushed  towards  that  direction. 


4.2  Applying  the  ROC  to  Object  Recognition 

In  our  problem  formulation,  Hq  =  probability  that  the  hypothesis  is  not  correct,  and 
Hi  =  probability  that  it  is.  In  our  Ccise.  we  have  a  different  ROC  curve  associated 
with  every  fixed  (m.n)  pair,  where  iv  +  3  and  n  +  3  are  the  number  of  model  and 
image  features,  respectively.  We  assume  that  Hq  and  Hi  have  Gaussian  densities 
f\v—  and  /»■„.  whose  means  and  variances  were  derived  in  Chapter  3.  Because  in  our 
formulation  the  variance  of  /iv„  is  always  greater  than  that  of  the  low'er  bound 
of  the  interval  defining  Zo  is  always  negative.  Since  we  can’t  in  practice  achieve  any 
score  lower  than  0.  we  will  treat  the  test  as  a  threshold  test,  that  is.  we  will  accept  a 
hypothesis  as  being  correct  if  it  falls  above  0  =  j*2. 

Using  this  technique,  we  can  predict  thresholds  for  simulated  experiments,  as  shown 
in  the  next  section. 


4.3  Experiment 

The  predictions  of  the  previous  section  were  tested  in  the  following  experiment:  to 
test  an  ROC  curve  for  model  size  in  +  3,  image  size  »» +  3,  occlusion  c  and  sensor  error 
cTo.  we  run  two  sets  of  trials,  one  to  test  the  probability  of  detection  and  one  to  test  the 
probability  of  false  alarm.  In  all  our  experiments  we  used  a  value  (Tq  =  2.5.  For  Pq.  a 
random  model  of  size  in  +  3  consisting  of  point  features  was  generated  and  projected 
into  an  image,  with  Gaussian  noise  added  to  the  x  and  y  positional  components  of 
each  point  feature,  independently.  Occlusion  (c)  is  simulated  by  adding  a  c  probability 
of  not  appearing  in  the  resulting  image  for  each  projected  model  point.  Only  correct 
correspondences  are  tested,  and  the  weight  of  each  of  these  correct  hypotheses  is 
found  using  the  algorithm,  restated  here: 

(a)  for  a  hypothesis  (mo.mi,m2)  :  (so,Si,S2),  find  all  Gaussian  weight  disk  loca¬ 
tions  and  sizes: 

(i)  find  affine  coordinates  rrij  =  (oj,i3j)  with  respect  to  basis  (mo, nii,m2) 

(ii)  projected  image  location  for  is  Sq  +  Oj(Si  —  Sq)  -f  <'^j(s2  —  Sq) 
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(iii)  projected  Gaussian  weight  disk  radius  for  nij  is  2<t  =  2(Qj  +  /Jj  +  [1  —  Qj  — 

m 

(b)  for  every  image  point  Sj,  initially  set  d  =  oo. 

(i)  find  the  minimum  distance  d  between  Sj  and  lUj  such  that  d  <  2a. 

(ii)  add  v  =  j^e  (the  height  of  the  Gaussian  weight  disk  at  the  image 
point  location)  to  the  sum  w,  which  is  the  total  score  for  this  hypothesis. 
If  this  image  point  did  not  come  within  2a  of  any  projected  model  point, 
then  V  =  0. 

We  performed  this  experiment,  keeping  a  histogram  of  the  weights,  until  there  were 
2500  saunple  points.  To  test  the  probability  of  false  alarm,  we  run  the  same  experiment 
using  random  images  which  do  not  contain  the  model  we  are  looking  for.  The  resultant 
histograms  are  normalized  to  yield  the  empirical  density  of  Wh  and  Wfj  for  the  given 
values  of  m,  n,  c  and  <to.  To  construct  the  ROC  curves  we  loop  through  25  thresholds 
and  tally  the  proportion  of  the  empirical  distributions  of  Wjj  and  W-jj  that  fall  above 
the  threshold,  yielding  a  {Pp,  Pd)  pair  for  each  one.  The  resulting  Pd,  Pp,  and  ROC 
curves  as  a  function  of  threshold  6  are  shown  in  Figure  4-2  for  n  =  10, 100,500,500, 
occlusion  c  =  0.0, 0.0, 0.0, 0.25.  The  ROC  curves  for  the  same  parameters  are  shown 
alongside.  The  axes  for  the  graphs  are  (x,y)  =  {9,  Pp),  (x,y)  =  {6,  Pd),  and  {x,y)  = 
{Pf,Pd). 

The  graphs  of  the  Pp,  Pd  and  ROC  curves  indicate  that  the  predicted  and  actual 
curves  match  very  well,  with  the  best  predictions  when  the  number  of  clutter  points  is 
high.  Turning  to  the  Pd  plots,  we  see  that  when  the  threshold  is  high  we  consistently 
underpredict  the  probability  of  detection.  This  error  works  in  our  favor,  since  it 
pushes  the  actual  {Pp,Pd)  points  up  above  the  predicted  ones.  This  high  threshold 
area  corresponds  to  the  region  on  the  ROC  curve  along  the  Pd  axis. 

The  discrepancies  between  the  curves  are  due  to  assumptions  we  made  in  the  ana¬ 
lytic  derivations,  the  most  significant  of  which  is  the  assumption  that  Wh  and  W-jf 
are  Gaussian.  In  fact,  none  of  the  displayed  empirical  curves  are  actually  Gaussian, 
though  when  the  clutter  is  high  the  distributions  are  more  nearly  so.  In  theory  we 
could  use  Chernoff  bounds  to  bound  the  expressions  for  Pd  and  Pp  for  a  given  thresh¬ 
old  [VT68]  but  we  will  not  explore  this  option.  Instead,  we  will  use  the  analytical 
curves  as  an  approximation  to  the  actual  curves,  and  note  that  despite  this  modelling 
error,  we  still  see  a  good  fit  between  empirical  and  actual  performance. 


4.3.1  Using  Model- Specific  ROC  Curves 

The  largest  discrepancy  between  predicted  and  actual  performance  can  be  traced  to 
Pd  prediction,  as  seen  in  Figure  4-2.  In  our  analysis,  we  assume  two  things  that  cause 
this  mismatch.  First,  we  assume  that  a  correctly  hypothesized  model  point  accumu¬ 
lates  weight  in  favor  of  a  correct  hypothesis  independently  of  any  other.  Second,  we 
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Figure  4-2:  Comparison  of  predicted  to  empirical  curves  for  probability  of  false  alarm, 
probability  of  detection,  ROC  curves.  The  empirical  curves  are  indicated  by  boxes.  The 
axes  for  graphs,  from  left  to  right,  are  (x,y)  =  {6,  Pf),  {B,  Pd),  and  ( Pp,  Pd).  For  all  graphs, 
m  =  10,  (To  =  2.5.  From  top  to  bottom,  n  =  10, 100,500,500,  occlusion  =  0,0,0,0.25. 
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assume  that  the  number  of  model  points  is  large  enough  so  that  we  can  approximate 
the  probability  density  of  H //  by  a  Gaussian.  However,  we  have  shown  in  the  previous 
section  how  to  empirically  derive  the  actual  and  Pq  curves  by  simulation.  We 
can  use  this  technique  to  tailor  the  overall  method  to  a  particular  model  in  order  to 
improve  the  prediction  for  that  model. 

Specifically,  the  method  would  work  as  follows:  the  same  simulation  as  was  performed 
in  the  previous  section  is  done,  using  the  given  model  instead  of  a  randomly  gener¬ 
ated  one,  with  no  occlusion  or  clutter.  A  simple  function  can  be  fit  to  the  actual 
distribution  for  H'//,  and  this  function  will  subsequently  be  used  as  the  density  of  Wjj 
(with  not  clutter  or  occlusion)  for  this  model.  The  density  of  H'jf  for  any  other  value 
of  n  and  c  can  then  easily  be  derived  from  this. 


Summary 

In  this  chapter  we  introduced  the  ROC  curve,  which  enables  us  to  encapsulate  all 
the  information  needed  to  make  a  decision  about  choosing  thresholds  to  determine 
performance.  That  is,  for  a  particular  image,  model,  and  threshold  for  the  weight  that 
a  hypothesized  match  must  score  in  order  to  accept  it,  we  can  predict  the  probability 
that  a  correct  or  incorrect  match  will  pass  the  threshold.  Conversely,  for  a  given  model 
and  image,  we  can  predict  the  threshold  required  to  achieve  a  given  probability 
true  detection  or  false  alarm.  We  applied  this  technique  to  simulated  models  and 
images  and  were  able  to  successfully  predict  thresholds  and  performance  for  a  wide 
range  of  model  to  image  sizes. 

The  ROC  curve  also  indicates  the  level  of  performance  achievable  for  a  particular 
model  and  image,  so  that  we  can  determine  when  a  desired  level  of  performance 
(for  instance,  0  probability  of  false  alarm  at  the  same  time  as  a  1.0  probability  of  a 
true  detection)  is  simply  not  possible  for  a  given  model  and  image.  In  effect,  we  are 
able  to  identify  when  an  image  is  simply  too  noisy  to  be  able  to  achieve  any  better 
performance  than  randomly  guessing  whether  a  given  hypothesis  is  correct  or  not. 
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Chapter  5 

Comparison  of  Weighting  Schemes 


In  the  previous  chapters  we  talked  about  decision  making  for  a  particular  weighting 
scheme;  however,  we  can  use  the  machinery  we  developed  in  the  last  chapter  not  only 
to  evaluate  hypotheses,  but  also  to  compare  the  relative  merits  of  different  possible 
weighting  schemes.  In  this  chapter  we  use  the  ROC  to  compare  several  such  schemes. 
We  have  already  discussed  one  weighting  scheme  which  we  will  call  Scheme  1.  Scheme 
2  will  denote  the  weighting  and  decision  scheme  generally  used  with  the  uniform 
bounded  error  model,  and  Scheme  3  will  denote  the  same  weight  disk  as  Scheme  1, 
but  using  a  weight  accumulation  algorithm  which  collects  evidence  from  at  most  one 
point  per  projected  error  disk.  Ultimately  we  will  decide  to  remain  with  the  original 
scheme  we  developed  in  Chapter  3. 

5.1  Uniform  Weighting  Scheme 

The  weight  disk  used  with  the  uniform  bounded  error  model  assigns  a  full  vote  to 
any  model  point  which  falls  inside  it.  For  a  model  point  with  coordinates  (a, in 
the  frame  established  by  the  model  basis  used  in  the  correspondence,  the  projected 
weight  disk  has  radius 

eo(  I  1  —  a  —  I  +  1  a  I  +  I  /?  I  -fl ). 

where  cq  is  the  radius  of  the  uniform  error  distribution  for  sensor  noise.  We  will  use 
the  symbol  Cg  to  describe  the  values  this  expression  takes  on  as  the  affine  coordinates 
vary.  The  expected  value  of  Vm  under  this  scheme  is  1  —  c,  and  for  nvjf  I  m  =  1] 
is  the  probability  that  a  random  point  will  contribute  a  vote  of  1  at  a  particular  weight 
disk  to  an  incorrect  hypothesis.  This  is  the  expected  size  of  the  weight  disk  over  the 

size  of  the  image,  which  is  =  This  last  expression  was  called  the  redundancy 

factor  fi  and  was  derived  analytically  in  [GHJ91],  but  for  our  comparison  we  took 
the  empirical  value  from  simulations  such  as  those  described  in  Section  3.6.1.  For  an 
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e  :  \fA  ratio  of  1  :  100,  fi  «  0.0034.  The  expected  value  of  is  the  probability  that 
a  single  random  point  dropping  into  an  image  with  m  circles  of  average  area  fi  will 
fall  into  one  or  more  of  them.  Using  the  notation 

1 0  otherwise 

the  expression  describing  the  probability  that  a  clutter  point  will  drop  into  one  or 
more  weight  disks  is  given  by  the  inclusion-exclusion  principle,  and  is 

m 

i=l 

If  we  upper  bound  this  probability  by  assuming  the  weight  disks  are  disjoint,  this  is 
simply  mfi.  This  approximation  constrains  the  number  of  model  points  to  be  less 
than  mfi~^  «  300. 

The  random  variable  Wjf  is  binomially  distributed: 


Fw-{k)  =  B{mn-n,k) 

The  distribution  for  Wh  is  a  little  more  complicated;  that  is,  in  order  to  observe 
exactly  k  points,  i  points  must  have  been  observed  from  the  model,  and  the  remaining 
k  —  i  points  were  random,  for  ail  numbers  from  0  to  k: 

k 

Fw„ik)  =  ^  B{1  —  c;  m,  i)B{mfi-,  n,k  —  i) 

i=0 

This  product  of  two  binomial  distributions  is  not  itself  binomial,  and  the  optimal 
Neyman  Pearson  test  to  distinguish  between  them  is  complicated  to  derive.  We  will 
use  a  simple  threshold  test  since  it  is  widely  used,  though  we  have  not  proven  that  it 
is  optimal  with  respect  to  the  Neyman  Pearson  criterion.  The  probabilities  of  a  true 
and  false  positive  using  a  threshold  test  are,  respectively 

Pd  = 

1=0 

Pf  = 

«=o 


Figure  5-1  compares  the  ROC  curves  for  Scheme  1  (Gaussian  weight  disk)  and  Scheme 
2  (uniform  weight  disk)  for  m  =  10,n  =  10, 50, 100, 500, 1000,  occlusion=  0.0  and  0.25. 
We  can  see  that  in  the  case  of  no  occlusion  and  for  small  values  of  n,  both  techniques 
predict  good  Pf  vs  Pd  curves,  though  the  bounded  uniform  weight  disk  has  better 
performance  because  ♦^here  is  no  possibility  of  a  false  negative  when  occlusion=  0, 
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while  with  the  Ciaussian  weight  disk  there  always  is.  However,  as  n  increases,  the 
performance  of  Scheme  2  breaks  down  more  rapidly  than  Scheme  1  for  both  occlusion 
values.  For  occlusion=  0.25,  both  schemes  perform  about  equally  for  small  values  of 
n  (for  example,  at  ii  =  100),  but  again  as  n  increases,  the  performance  of  Scheme  2 
degrades  more  dramatically  than  that  of  Scheme  1  (n  >  500). 

5.2  An  Alternative  Accumulation  Procedure 

In  Section  3.8  we  presented  the  basic  recognition  algorithm,  and  pointed  out  that 
in  the  accumulation  procedure,  we  add  the  contributions  from  ev'ery  imagt  point,  as 
opposed  to  at  most  one  point  per  error  disk.  That  is,  if  several  image  points  fall 
within  the  same  error  disk,  we  add  the  contributions  from  all  of  them.  Intuitively 
one  would  expect  that  this  would  not  work  eis  well  as  simply  taking  the  value  of  the 
closest  point  to  the  center  of  each  error  disk. 

In  this  section  we  investigate  the  what  happens  to  the  ROC  curve  if  we  modify  the 
accumulation  step  to  take  only  the  “heaviest”  point  per  error  disk,  i.t.,  the  one 
appearing  closest  to  the  disk  center.  This  weight  .scheme  will  be  called  Scheme  3.  and 
we  will  use  the  same  variable  names  as  we  did  for  Scheme  1,  but  with  a  in  the  name 
to  differentiate  the  random  variables  and  their  distributions  from  those  of  Scheme  1. 
The  derivations  of  the  density  functions  for  Scheme  3  are  more  difficult,  and  we  will 
end  up  approximating  the  density  function  fvs.  such  that  is  underestimated. 

Surprisingly,  we  will  see  that  even  with  this  underestimate,  the  theoretical  ROC  curve 
for  Scheme  3  is  not  as  good  as  for  Scheme  1. 

We  begin  by  defining  two  new  random  variables,  and  V^.  The  difference  between 
\\f  and  is  that  the  former  variable  described  the  weight  that  a  true  image  point 
would  yield  —  that  is,  a  point  which  actually  arises  from  the  model  when  a.  correct 
correspondence  between  model  and  image  points  has  been  established,  and  the  rest 
of  the  model  points  are  projected  into  the  image.  is  the  weight  that  a  true  weight 
disk  will  contribute  to  the  accumulated  sum  —  that  is,  a  disk  which  is  projected  into 
the  image  when  a  correct  hypothesis  is  ’.jeing  tested,  when  the  image  contains  n  +  3 
points.  The  same  distinction  holds  for  the  variables  and  V^. 

We  begin  with  the  random  variable  V^.  Extending  a  derivation  given  in  [BRB89]  for 
the  one  dimensional  case,  we  first  define  yet  another  random  variable,  AT-  =  distance  of 
the  closest  point  to  the  center  of  a  disk,  when  the  disk  contains  k  uniformly  distributed 
points.  We  derive  the  probability  density  function  as  follows:  Let  a  be  the  radius  of 
the  disk.  We  divide  the  disk  into  an  inner  disk  and  2  rings: 
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Figure  5-1:  Comparisons  of  uniform  and  Gaussian  weight  disks  for  m  =  10,  n  =  10,  50, 
100,  500,  1000.  Left;  uniform  weight  disk,  right:  Gaussian  weight  disk.  Top:  Occlusion=0, 
bottom;  occlusion =0.25.  For  all  ROC  curves,  the  i  and  j/  axes  are  Pf  and  Pdi  respectively. 
A  low  threshold  results  in  an  ROC  point  on  the  upper  right  corner.  As  the  threshold 
increases,  the  performance  (PfjPd)  moves  along  the  curve  toward  the  lower  left  corner. 
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x+h 


To  find  the  probability  density  of  AT,  we  first  find  the  probability  that  of  h  points, 
none  fall  in  the  inner  disk,  one  falls  inside  the  />-ring  and  the  remaining  t  —  1  fall 
outside  the  ring; 


P{j-  <  AT  <  X  +  h} 


C)  (‘^)'  (  —  (ar  +  h)^ 


,  {2hx  +  h^){a^  —  —  ‘lax  —  * 


Now  we  take  the  above  expression,  divide  by  the  width  of  the  ring  h,  and  take  the 
limit  as  /*  — ♦  0: 


/AT(x) 


P{x  <  Xk  <  X  h) 
urn - ; - 

/i^o  h 

lim  —^{Ix  +  h)(a^  —  x^  —  2hx  — 
A— *0 


Now,  let  Yn  =  distance  of  closest  point  to  the  center  of  a  given  weight  disk  disk  (i.e., 
(Tf  is  fixed)  in  an  image  with  n  +  3  points.  So,  the  radius  of  the  disk,  a,  is  2a g.  For 
legibility,  let  us  set  fig  =  to  be  the  probability  that  a  random  image  point  falls 
in  the  error  disk.  Then  the  probability  that  the  closest  image  point  to  the  disk  center 
is  at  a  distance  x  equals  the  probability  that  exactly  k  points  fall  in  the  disk  times 
the  probability  that  the  closest  of  the  k  points  is  x  away  from  the  center  of  the  disk, 
for  all  k: 

n 

I  ^  P{^  pts  fall  in  disk}  P{n  —  k  pts  fall  outside}  /x*|£r.(^  I 

k=l 
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To  derive  fvs.\aA^'  1  cr).  we  have  to  determine  the  density  of  for  a  fixed  <7^,  where 

M 

--“t 

g(y)  =  .  This  is  extraordinarily  complicated,  and  we  will  not  even  attempt 

it. 

Instead,  we  do  the  following.  First  let  us  assume  that  n  <  ^.  We  will  justify  this 
assumption  shortly.  This  together  with  the  fact  that  (1  —  /ig)  «  1  means  that  not 
only  is  the  binomial  term  decreasing  after  the  first  term,  but  that  the  second  term 
is  less  than  half  the  first.  Therefore,  we  take  the  liberty  of  approximating  the  entire 
distribution  by  the  first  term.  When  we  do  this,  the  term  becomes  much 

simpler,  since  we  only  have  to  worry  about  the  case  where  a  single  random  image 
point  falls  in  the  error  disk,  Ic  =  1: 


fXr\^Ar  1  <t)  =  , 


2<t2 


Not  surprisingly,  this  is  the  same  expression  tis  we  derived  back  in  Chapter  3  for  the 
distance  of  a  single  point  from  the  center  of  a  disk,  when  the  point  is  drawn  from 
a  uniform  distribution.  At  that  time  we  derived  the  weight  that  such  a  point  will 
contribute  when  using  our  weighting  scheme  g: 

fgirncAv  1  a)  = 

So,  combining  this  expression  with  the  probability  that  a  single  point  will  fall  in  a 
given  error  disk,  we  get 


Mg)  fg{x)\(T^A’  I 


which  is  exactly  n  times  the  dis  m  fvjfiv  |  m  =  1)  that  we  derived  back  in 

Chapter  3.  Without  rehcishing  all  eps,  we  simply  point  out  a  few  differences.  In 

particular,  in  Equation  3.15,  m  Wtis  bounded  above  by  ^  ♦  (47rao[^? —  in  order 

for  the  distribution  /v—fr?)  to  be  a  density  function.  For  the  distribution  fv±.(v)  the 
same  bound  must  hold,  but  for  n  instead  of  m.  Let  us  call  N  the  maximum  number 
of  'llowable  image  points.  Now  we  can  justify  our  first  ^lssumption  that  n  <  ^:  Let 
us  use  the  expected  area  of  the  Gaussian  error  disk  over  all  values  of  (Tg.  This  is  given 
by  the  expression: 


I  7r(2<7-)V„ 

Jsi 
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The  maximum  number  of  disks  that  can  fit  into  the  image  (assuming  the  disks  are 
disjoint)  is  the  image  area  A  divided  by  this  expression,  which  is  exactly  the  bound 

—  that  we  assumed  above. 

To  sum  up,  we  have  derived  the  approximation 

fvjLiv)  ss  min(n,jV)/v  (r  |  m  =  1) 

M  ^ 

and  therefore 

E[v^  «  min(n,A')E[V^  |  m  =  1] 

E[(V^)^]  «  min(n,yV)E[K^lm  =  l] 

Var(V^)  «  min(n,  Ar)E[V'^  I  m  =  l]  -  min(n,yV)^E[V'jjj  I  m  =  1] 

All  the  terms  on  the  right  hand  side  of  these  equations  are  known  quantities  that  were 
derived  back  in  Chapter  3.  Note  that  because  of  our  approximations,  our  prediction 
tor  E[V5^]  is  an  underestimate  of  the  distribution’s  actual  first  moment. 

Finally  we  derive  We  first  look  at  a  single  correctly  hypothesized  weight  disk 

—  that  is,  the  weight  that  is  scored  by  a  disk  which  is  projected  into  the  image  as  a 
result  of  testing  a  correct  hypothesis.  The  weight  disk  always  contains  the  correctly 
projected  model  point,  unless  (a)  the  point  is  occluded,  or  (b)  the  point  falls  outside 
the  2(Te  Gaussian  weight  disk.  If  either  of  these  two  things  happen,  then  all  of  the 
clutter  points  get  a  chance  to  score  inside  the  weight  disk.  We  will  also  assume 
for  simplicity  that  if  the  true  point  appears  inside  the  disk  then  we  will  take  its 
contribution  even  if  clutter  points  also  appear  inside  the  disk.  Then  the  probability 
that  we  will  see  weight  u  >  0  is: 

P{disk  gets  weight  u  >  0}  = 

P{disk  gets  weight  u  >  0  (  true  point  seen)  + 

P{true  point  not  seen}  Pjdisk  gets  weight  u  >  0  |  false  point  seen) 

And  for  the  case  when  v  =  0: 

P{disk  gets  weight  i?  =  0}  =  P{true  point  not  seen}  P{false  point  not  seen} 

For  convenience  let  us  call  B  the  probability  that  no  clutter  point  falls  inside  a  true 
weight  disk.  In  the  case  in  which  occlusion=  0  the  expression  for  the  density 
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would  be  correctly  given  by  the  expression 

f  (c+ (1  -  r  =  0 

^  fy^{r)  +  [c  ■].  {\ -  c)e-^]fv^(v)  r^O 

in  which  the  c’s  would  disappear.  When  occlusion  ^  0  we  have  the  problem  that  we 
don't  know  how  many  of  the  observed  image  points  are  clutter  points.  Therefore,  we 
must  define  a  random  variable  M  describing  the  number  of  model  points  that  actually 
show  up  in  the  image.  A/  is  binomially  distributed  with  mean  (1  —  c)m  and  variance 
c(l  —  c)m.  Using  this  random  variable  instead  of  m  in  the  expression  for  fv±_  in  the 
above  expression,  the  density  fv  becomes: 


fvtM')  = 


[c+(l~c)f  ^]B6{v)  =  0 

/vm(‘’)  +  [c  +  ( 1  -  c)e~^]  min(n  -  Af.  N)fy^(v  |  m  =  1 )  r  ^  0 


Let  us  temporarily  rename p  =  and  assume  that  min(n  — A/,  N)  =  n  —  M 

for  ease  of  manipulation.  Then 


f  -  jp€~^B6{v) 


=  0 

M)fv—{v\Tn-\)  r  0 

M 


We  use  Equations  A.2  and  A. 3  to  remove  the  random  variable  A/,  first  for  the  mean: 

=  EfV/if]  +p(»  -  (1  -c)m)E[Vjf  \m  =  l]  (5.1) 

and  proceeding  in  stages  for  the  variance: 

ElV'^IM]  =  E(V„l+p(n-jW)E|Vijj|m  =  ll 
Var(V^IM)  =  e[(V'^)M  Af]  -  E|V;S  | 

=  [E[v^]+p(n-A/)E[vf|m=l]]- 

[ElV„|+p(n-Af)E|Vjr|m=l)]" 

Var(ElV'^IMl)  =  p^ElV^  |  m  =  1]' Var  (Af) 

E(Var(V%|A/)l  =  E[A'i]  -  EIEmI"  +  pEl(a  -  A/))  e[vA  |  m  =  l] 

-p^E[(n  -  Mf\  E|Vjf  I  m  =  1]' 
-2pE|n-A/)E|V'„lElV5i,|m  =  l| 

Next,  substituting  the  expression  E[Af^]  =  Var(A/)  +  E[Af]^  into  the  last  equation 
and  solving  for  the  entire  expression,  we  get: 

Var(U^)  =  Var(E[U^|A/])  +  E[Var(V^^|A/)] 

=  p^ElVj,  I  m  =  1|^  Var(Af)  +  E[V'i|  -  E[V„]" 

+p(n  -  ElAf])E[Vji  I  m  =  l] 
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-p"[(n  -  E|A/|)"  +  Var{A/)]E|Vj,  |  m  =  1|' 
-2pE|n  -  A/|  E!l'„l  ElVji  |  m  =  1| 


Note  that  the  first  term  in  the  sum  cancels  out  the  variance  in  the  third  line,  leaving 
the  expression; 

Var(V';,)  =  E[V?,]-E[V'„f 

+p{n  -  E(A/|)E[Vi  I  m  =  l]  -  pHn  -  E[A/])'E(Viir  |  n,  =  I)' 
-2p(n  -  E[A/1)E(V'„1  ElV'jj  I  m  =  IJ 


Putting  back  the  min  expression  and  substituting  the  value  of  E[A/]  we  finally  get 


Var(VX/) 


1/2 


-  E[Vm]^  +  pmin(n  -  (1  -  c)m.N)E[V-^ 

—p^  min(n  —  (1  —  c)m.  A'^)^E[V]fj-  |  rn  =  1]^ 
-2pmin(n  -  (1  -  c)m,  A')E[Vm]  E[V^  |  m  =  1] 


in  which  all  the  terms  are  known. 

The  accumulated  densities  Wj^  and  are  simply  collected  over  all  m  error  disks 
independently,  so  that 


W„  -  JV(mE[ia.mVar(V';,)) 

Wj,  ~  iV(mE[V-CT].mVar(V^)) 

In  Figure  5-2  we  show  an  ROC  comparison  of  Schemes  1  and  3.  The  new  method. 
Scheme  3,  performs  very  poorly  in  theory  because  as  the  number  of  clutter  points 
go  up,  the  chance  of  at  least  one  point  appearing  in  every  disk  is  very  high.  When 
this  happens  it  is  no  longer  possible  to  distinguish  between  correct  and  incorrect 
hypotheses.  For  a  <Tq  :  \/A  ratio  of  1:200,  the  maximum  number  of  image  points 
allowable  by  the  method  is  818;  at  this  point  a  random  point  will  appear  in  every 
weight  disk  with  probability  1  and  the  ROC  curve  becomes  almost  diagonal. 

In  Figure  5-3  we  see  the  actual  Pp,  Pp  and  ROC  curves  using  this  weighting  method. 
The  predicted  performance  greatly  underestimates  the  actual  performance  of  the 
method.  This  discrepancy  is  due  to  a  simplification  which  we  glossed  over  in  our 
analysis,  which  is  the  assumption  that  the  weight  disks  are  disjoint.  When  this  is  not 
the  case,  then  this  assumption  overestimates  the  actual  probability  of  a  clutter  point 
landing  in  a  weight  disk,  thereby  overestimating  the  mean  of  the  random  variable 
V^.  The  same  effect  occurs  in  Scheme  1  as  well,  though  to  a  lesser  extent. 

Despite  the  fact  that  the  actual  performance  of  Scheme  3  is  better  than  predicted,  it 
is  still  the  case  that  the  actual  ROC  curves  for  Scheme  3  are  not  as  good  as  those  for 
Scheme  1,  as  can  be  seen  in  Figure  5-4. 

It  is  clear  that  Scheme  1  is  better  than  Scheme  3  most  importantly  because  the  latter 
performs  better  in  actual  simulations  than  Scheme  3.  It  also  has  the  advantage  that 


60 


our  predictions  for  the  distributions  of  Wh  and  Wjj  are  more  accurate  than  those 
of  and  Wjj.  For  both  these  reasons  we  will  perform  all  experiments  in  the  next 
chapter  using  Scheme  1. 

Summary  of  Weighting  Schemes 

In  this  chapter  we  discussed  two  different  possible  weighting  schemes  and  used  the 
ROC  curves  to  compare  them  to  the  original  scheme  we  developed  in  Chapter  3. 
Ultimately  we  showed  that  our  original  scheme  has  better  error  performance  than  both 
alternative  schemes.  It  is  important  to  note  that  none  of  the  schemes  we  have  analyzed 
is  optimal  with  respect  to  a  maximum  likelihood  criterion,  which  would  assign  a 
better  score  to  hypothesis  Ha  than  to  Hb  if  P{\mage  |  Ha}  >  P{\mage  \  Hb}-  For 
the  remainder  of  the  thesis  we  will  use  the  original  scheme  we  developed  in  Chapter 
3. 
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Figure  5-2:  The  graphs  show  the  comparison  of  ROC  curves  for  Scheme  1  (top  curve) 
versus  Scheme  3  (bottom  curve).  The  x  and  y  axes  are  Pp  and  Po^  respectively.  Increasing 
(PfiPd)  corresponds  to  a  decreasing  threshold  for  the  direction  of  The  (m,n)  pairs  are 
(10,100),  (10,500),  (30,100),  and  (30,500).  For  all  graphs,  occlusion  =  0  and  (Tq  =  2.5. 
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Figure  5-3:  Comparison  of  predicted  to  empirical  curves  for  probability  of  false  alarm, 
probability  of  detection,  and  ROC  curves  for  Scheme  3.  The  empirical  points  are  indicated 
by  boxes.  The  axes  for  graphs,  from  left  to  right,  are  (x,  y)  =  (^,  Pf),  {0,  Pd),  and  {Pp,  Pd)- 
For  all  graphs,  m  =  10,  occlusion  =  0,  (Tq  =  2.5.  From  top  to  bottom,  n  =  10, 100, 500. 
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Figure  5-4:  Comparison  of  empirical  ROC  curves  for  Schemes  1  and  3.  In  all  graphs  the 
ROC  curve  for  Scheme  1  is  above  that  of  Scheme  3.  The  axes  are  (x,y)  =  (Pf^Pd)-  For 
all  graphs,  m  =  10,  occlusion  =  0,  <to  =  2.5.  From  left  to  right,  n  =  10, 100,500. 
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Chapter  6 

A  Feasibility  Demonstration 


In  the  preceding  chapters  we  presented  a  theoretical  approach  to  placing  a  bound  on 
the  probability  of  true  versus  false  detection  for  the  output  of  a  recognition  problem 
in  a  limited  domain,  and  some  simulations  supporting  the  viability  of  the  method.  In 
this  chapter  we  argue  that  the  Gaussian  error  model  is  a  reasonable  approximation  for 
point  feature  locations  by  measuring  the  noise  associated  with  different  point  feature 
types.  Finally,  we  demonstrate  the  process  of  applying  the  analysis  to  real  images. 


6.1  Measuring  Noise 

6.1.1  Feature  Types 

A  point  feature  is  a  physical  aspect  of  the  model  which  can  be  detected  at  a  2D 
location  in  an  image  of  the  model,  regardless  of  the  model’s  pose.  When  consid¬ 
ered  in  this  light,  we  can  see  that  there  are  two  aspects  of  how  powerful  a  feature 
type  is  as  a  representation  —  -  its  ability  to  represent  the  model,  and  its  ability  to  be 
reliably  extracted  from  the  image.  Asada  and  Brady  [AB86]  discuss  a  model  repre¬ 
sentation  called  a  Curvature  Primal  Sketch,  in  which  point  features  are  defined  as 
distinctive  points  in  the  curvature  of  the  boundary  of  the  object;  /.e.,  zero  crossings, 
minima/maxima,  and  discontinuities,  in  the  boundary  curve’s  first  derivative.  In  the 
domain  of  planar  models,  these  model  features  are  all  invariant  to  affine  transforma¬ 
tions  and  by  extension,  pose  (note  that  it  is  t  location  of  the  features,  and  not  the 
magnitude  of  the  boundary  curve’s  first  derivative  at  these  points,  that  is  affine  invari¬ 
ant).  However,  reliably  extracting  these  sorts  of  features  from  an  image  is  difficult, 
since  their  location  is  extremely  dependent  on  factors  such  as  pixel  resolution,  image 
processing  parameters,  and  even  model  pose,  since  at  certain  poses  the  magnitudes 
of  the  boundary  curve’s  first  derivative  becomes  so  small  that  the  features  become 
undetectable. 

Another  possible  representation  is  to  limit  the  feature  types  to  a  single  kind  of  cur¬ 
vature  discontinuity,  that  is,  intersections  of  straight  line  segments  greater  than  some 
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fixed  length.  Line  crossings,  junctions,  and  corners  are  all  examples  of  this.  This 
representat  ion  has  its  own  jirohlenis  since  hounclarv  curves  on  the  model  cannot  l)c 
reprc'sentc‘cl  at  all  hy  line  segments.  From  the  image'  processing  side,  a  curve  appc'aring 
in  an  image  gives  rise  to  a  incU'terminatc' numl)er  of  corner  fc'atures,  depending  on  the 
magnitude  of  the  curvature  and  image  processing  parameters.  However,  the  locations 
of  intersections  of  long  straight  line  segments  might  he  more*  stably  dc'tected. 

Wells  [VVel92]  use's  as  point  fc'aturc's  the  center  of  mass  of  connc'ctc'd  pixel  strings 
of  length  k\  broken  randomly.  This  is  similar,  but  not  c'ciuivalent.  to  sampling  the 
contour  of  t  he  objc'ct  at  fi.xc'cl  intervals.  One  possible  problem  with  this  type  of  point 
fc'ature  is  simply  that  there  are  so  many  -  that  is.  if  h  is  too  small,  we  are  not 
significantly  i)runing  the  .search  in  transformation  space  by  using  these  fc'atures  to 
form  hypothesc's. 

Few  recognition  systems  in  the  literature  use'  the  simple  point  fc'atures  described 
above;  for  example.  S('F'.RPO  [LowSG]  and  HYPER  [AFSb]  both  use  entire  line  seg¬ 
ments  as  features,  the  Local -Feature'- Focus  method  of  Holies  and  Cain  [BCS2]  uses 
oriented  corners  and  holes.  Fluttenlocher's  OR.A  systen)  [Flut88]  uses  oriented  points, 
and  F^ttinger's  SAPPFHRE  system  ([Ett87])  uses  the  compound  point  features  de¬ 
fined  in  the  C’urvature  Primal  Sketch.  The  advantage  of  using  more  information  per 
fc'ature  is  that  the  additional  information  will  often  eliminate  hypotheses  consisting 
of  impossible  image-feature  pairings. 

One  disadvantage  of  using  more  complex  features  is  that  there  is  more  likely  to  be 
errors  in  their  extraction,  and  this  error  may  prevent  correct  image-model  feature 
pairings  from  being  tested.  .Also.  Jacobs  has  recently  shown  that  some  basic  work 
using  simple  point  feature's  in  the  domains  of  linear  combinations  of  models  [UB89] 
and  indexing  of  M)  models  [CJ91]  does  not  extend  to  oriented  point  features  [Jac92]. 

We  wish  to  sidestep  the  issue  of  which  is  the  best  feature  representation  by  arguing 
that  no  matter  what  feature  type  is  chosen,  there  will  inevitably  be  some  error  in 
extracting  the  features  from  the  image,  no  matter  what  dimensionality  the  feature 
type  has.  be  it  a  simple  2D  location,  or  a  2D  location  with  orientation,  magnitude, 
or  any  combination  of  other  kinds  of  information.  Our  goal  is  to  argue  that,  given  a 
feature  type,  we  can  measure  the  noise  associated  with  it  and  apply  an  error  analysis 
to  determine  the  probability  of  false  versus  positive  identification.  Because  it  was 
simpler  to  use  simple  point  features,  we  have  limited  ourselves  to  using  only  these. 

VV'e  have  measured  the  noise  associated  with  four  feature  types,  under  different  con¬ 
ditions.  They  are 

•  intersections  of  straight  line  segments  of  a  fixed  minimum  length. 

•  points  of  maximum  curvature. 

•  inflection  points. 

•  centers  of  mass  of  connected  fixed  length  pixel  strings. 


The  actual  algorithm  we  used  to  extract  these  features  is  unimportant,  since  we  are 
interested  in  measuring  the  variability  of  each  type  of  feature,  given  a  fixed  feature 
finder.  We  have  attempted  to  measure  the  noise  per  feature  type  as  a  function  of 

•  different  images  of  the  same  scene, 

•  different  degrees  of  image  smoothing, 

•  illumination. 

Interestingly  enough,  there  was  some  variation  in  feature  locations  even  for  the  first 
image  group,  where  one  would  expect  the  images  to  be  identical.  In  fact,  there  are 
slight  differences  in  pixel  values  between  images,  probably  due  to  different  amounts 
of  light  reaching  the  camera  during  the  imaging  stage  (fluorescent  lighting  was  used), 
or  possibly  due  to  quantization  error.  This  introduces  a  level  of  uncertainty  in  all 
the  subsequent  image  processing  stages,  from  image  smoothing  to  edge  detection  to 
boundary  tracing  to  subsequent  feature  extraction. 


6.1.2  Procedure  for  Measuring  Noise 

To  measure  the  noise  associated  with  each  feature  type  under  each  kind  of  condition, 
three  groups  of  images  processed  as  follows: 

•  5  images  of  a  telephone,  same  illumination,  at  5  second  intervals.  Each  image 
was  smoothed  with  a  Gaussian  mask  with  cr  =  2  pixels  and  Canny  edge  detected 
with  thresholds  of  2  and  4. 

•  A  single  image  of  a  fork,  with  5  different  sized  Gaussian  smoothing  masks: 
cr  =  1,1.5, 2, 2.5,  and  3  pixels.  Each  image  was  Canny  edge  detected  with 
thresholds  of  2  and  4. 

•  5  images  of  an  army  knife,  varying  illuminant  position  and  strength.  Each 
image  was  smoothed  with  a  C-mssian  mask  with  cr  =  2  pixels  and  Canny  edge 
detected  with  thresholds  of  2  and  4. 

The  result  of  processing  yielded  5  different  edge  maps  for  each  group.  The  original 
images  were  all  taken  with  a  Panasonic  TV  camera  w'ith  automatic  gain  control, 
using  an  16mm  lens  and  manually  focused.  The  exact  conditions  were  not  measured 
precisely,  since  we  are  interested  in  them  only  insofar  as  they  conform  to  “reasonable” 
operating  conditions.  All  images  consisted  of  a  720  x  484  pixel  map.  Only  overhead 
lighting  was  used  except  for  the  last  image  group,  for  which  a  floodlight  was  used  to 
change  the  direction  of  illumination. 

For  each  edge  map,  all  chains  of  connected  pixels  were  computed  and  smoothed,  and 
each  feature  type  found: 
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•  Intersection  points  the  chains  were  segmented  into  straiglit  edge  segments 
using  a  recursive  line-splitting  algorithm  ([C5LP87]).  then  all  intersection  points 
whose  distance  was  <  10  pixels  away  from  the  ends  of  the  segments  which 
formed  them  were  kept. 

•  Maximum  curvature  points  -  the  derivative  of  the  tangent  along  the  curve  was 
computed,  then  all  minima  and  maxima  exceeding  a  fixed  threshold  were  kept 
as  point  features. 

•  Inflection  points  —  the  zero  crossings  of  the  tangent's  derivative  along  the  curve 
whose  absolute  slope  exce<*ded  a  fixed  threshold  were  kept  as  point  features. 

•  Mass  centers  of  chain  fragments  —  the  chains  were  broken  into  fragments  of 
length  10,  then  the  center  of  mass  of  each  one  taken  as  a  point  feature. 

For  each  image  group,  the  feature  locations  for  each  of  the  5  images  were  indicated 
on  a  single  bitmap,  color  coded  by  the  index  of  the  image  from  which  it  came.  In 
Figures  6-1.  6-;l  and  6-5.  the  features  are  showm  only  in  white,  due  to  reproduction 
limitations.  The  correspondences  across  images  were  manually  indicated  for  the  first 
three  feature  types  by  mousing  on  clu.sters  which  the  user  believed  indicated  a  feature 
of  a  given  type.  For  the  fourth  feature  type,  the  correspondences  were  automatically 
formed  by  clustering  together  those  features  from  every  image  who  mutually  agreed 
on  their  nearest  corresponding  feature.  Figures  6-1,  6-3  and  6-5  show  a  representative 
image  from  each  image  group,  and  the  point  features  from  all  the  image  groups  with 
their  correspondences  indicated  by  circles.  The  feature  types  depicted  are  intersection 
points  (phone),  centers  of  mass  (fork),  and  maximum  curvature  points  (army  knife). 
The  inflection  point  feature  was  the  most  unstable,  and  is  not  illustrated. 

For  each  cluster  within  a  given  feature  type,  the  mean  in  both  the  x  and  y  direction 
was  calculated,  then  for  each  feature  in  the  cluster,  its  distance  from  the  mean  of 
the  cluster  was  histogrammed.  This  yielded,  for  each  image,  a  histogram  per  feature 
type.  This  histogram  is  intended  to  be  an  accurate  sample  of  the  error  distribution  of 
the  feature  type.  Some  sample  histograms  are  shown  next  to  the  pictures  from  which 
the  features  were  clustered. 

In  addition,  the  error  distribution  of  a  third  coordinate  for  each  feature  type  was 
calculated.  For  the  intersection  points,  the  third  dimension  is  the  angle,  for  maxi¬ 
mum  curvature  points,  it  is  the  magnitude  of  the  curvature,  for  inflection  points,  the 
slope,  and  for  centers  of  mass,  the  tangent  of  the  curve  at  that  point.  The  results 
indicate  that  the  error  distribution  along  this  third  dimension  can  also  be  modelled  as 
Gaussian,  and  suggests  that  our  method  could  be  extended  to  incorporate  this  extra 
information,  though  we  have  not  done  so. 

The  calculated  variances  of  each  feature  type  per  image  group  are: 
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Figure  6-1:  First  iiiiagr  group.  Mir  grou|)  consists  of  ■)  iinagrs  of  a  trloplioiir.  saiiir 
lighting  conditions  an<l  smoothing  mask.  1  lie  lop  figiirt'  sliows  one-  of  tin'  images  of  t lie 
group,  with  the  intersection  of  straiglit  line  segment  features  from  all  ■’)  image's  of  the'  grou]) 
su|)erimpose(l  in  white.  I'ln'  hot  tom  figure  shows  lli*'  I'f'siilt  after  ('anny  «'dge  (h'tc'clion  and 
chaining.  IIh'  straight  lim'  seg’uents  from  which  th*'  intersection  points  were  takc'ii  art'  not 
illustrated.  Siiix'rimixised  on  tm'  hottcjin  figure’  are  the'  locatiem  of  the’  elusters  chose'n  for 
the’  noise  me’asurement s.  indicate'd  l)v  circle’N.  .Ml  interse'ct ieiii  features  loeale’el  within  the' 
heuinels  eif  llu’  eireh’  we’re’  iise’d  as  sample’  peiiiils  leer  the’  neiise  me’asure’me’iit . 
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Figure  6-2:  Left  hand  side,  from  top  to  bottom:  the  histograms  of  the  x,  y  and  r  coordi¬ 
nates  of  the  intersection  features  depicted  in  the  previous  figure.  For  intersection  features, 
the  c  coordinate  is  the  angle  of  intersection.  The  Gaussian  distribution  with  mean  and 
variance  defined  by  the  histogram  is  shown  superimposed  on  the  graph.  On  the  right  hand 
side  is  the  cumulative  histogram,  again  with  the  cumulative  distribution  superimposed. 
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Figure  6-3:  Second  image  group,  consisting  of  the  edges  from  5  different  smoothing  masks 
of  the  same  image.  Again,  the  top  figure  shows  one  of  the  images  of  the  group,  but  the 
features  shown  are  the  center  of  mass  features  from  the  boundary,  randomly  broken  into 
segments  of  length  10.  The  bottom  figure  show  the  clusters,  j.e..  groups  of  features  which 
mutually  agree  upon  their  nearest  neighbor  features  across  all  of  the  images. 


71 


Q.l 


Figure  6-4:  Histograms  of  the  x,  y  and  c  coordinates  of  the  center  of  mass  features  depicted 
in  the  previous  figure.  For  this  feature  type,  the  r  coordinate  is  the  angle  of  the  tangent  to 
the  curve  at  the  center  of  mass. 
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nife  under  5  different  illuminations.  The  top 
ith  the  maximum  curvature  features  from  all 
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Figure  6-6:  Histograms  of  the  x,  y  and  ^  coordinates  of  the  maximum  curvature  features 
depicted  in  the  previous  figure.  The  z  coordinate  is  the  magnitude  of  the  curvature. 
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Since  there  were  so  few  reliable  clusters  of  inflection  points  for  each  set  of  images,  the 
results  from  the  distribution  can’t  be  taken  as  representative  of  any  sort  of  underlying 
distribution  for  this  feature  type.  The  intersection  and  maximum  curvature  points 
were  fairly  abundant  and  stable.  For  the  centers  of  mass  feature,  note  that  the 
variances  are  about  twice  as  large  in  the  x  direction  as  the  y  direction.  This  is 
because  the  contour  of  the  objects  were  aligned  more  along  the  x  direction,  so  the 
uncertainty  would  naturally  be  greater  along  the  contour’s  tangent  due  to  the  manner 
in  which  these  points  were  found.  We  expect  that  any  directional  bias  of  this  sort 
will  be  symmetrised  by  the  random  orientation  of  the  object  in  the  image. 

Because  the  results  between  image  groups  are  so  disparate,  we  chose  to  use  the  average 
variance  for  a  particular  image  group  as  a  guide  for  choosing  (Jq  per  experiment  (again, 
assuming  that  the  rotational  component  of  the  pose  distribution  allows  us  to  do  this). 
The  calculation  is  simply 


(To  = 


per  feature  type  per  experiment.  In  our  subsequent  work  we  will  limit  ourselves  to 
using  only  maximum  curvature  feature  types.  For  these  features,  the  above  calculation 
yields  (Tq  w  .65,1.7,1.8  for  the  phone,  fork,  and  knife  respectively.  In  the  actual 
experiments  it  was  found  that  better  performance  was  achieved  by  using  values  that 
were  slightly  larger  than  these  for  the  phone  and  the  knife. 
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6.1.3  Discussion  of  the  Method 


There  are  several  aspects  of  the  noise  calculation  that  an  observer  might  take  issue 
with.  The  first  is  simply  the  «issumption  that  the  real  location  of  the  sought  after 
feature  is  the  mean  of  every  cluster;  that  is,  there  is  no  bias  in  the  error  distribution. 
This  brings  up  the  question,  what  is  the  real  location  of  a  feature?  Suppose  the 
particular  feature  finder  we  were  using  alw’ays  displaced  features  to  the  left  by  3 
pixels,  or,  displaced  features  in  an  orientation-dependent  fashion.  Representing  the 
distribution  of  vectors  between  where  we  believe  the  feature  should  be,  and  where  the 
feature  finder  actually  localizes  them,  can  be  a  problem.  However,  the  complexity 
disappears  if  we  decide  to  let  the  feature  finder  be  the  judge  of  the  “actual"  location 
of  the  features.  This  will  require  the  model  representation  to  be  determined  by  the 
feature  finder  (we  will  discuss  exactly  how  in  the  next  section).  .Also,  any  orientation- 
dependent  directional  bias  of  the  feature  finder  should  be  randomized  by  the  fact  that 
the  orientation  of  the  model  in  the  image  is  random  as  well. 

.Another  objection  might  be  that  we  are  not  measuring  the  correct  thing:  rather, 
what  should  be  measured  is  the  displacements  of  a  single  feature  from,  say,  100 
different  images.  Instead,  what  we  are  really  doing  is  sampling  many  different  random 
variables,  and  as  such,  it  is  no  wonder  that  we  are  ending  up  with  a  close  to  Gaussian 
error  distribution,  since  by  the  Central  Limit  Theorem,  the  sum  of  many  different 
random  variables  will  be  Gaussian,  no  matter  what  their  individual  distributions.  The 
answer  to  this  charge  is  that  this  sum  of  random  variables  is  exactly  the  distribution 
that  we  are  interested  in  measuring;  far  from  invalidating  the  method,  this  objection 
reinforces  it. 

Another  question  might  be  about  the  manner  in  which  the  clusters  were  formed;  that 
is,  at  least  for  th“  3  out  of  4  of  the  feature  types,  we  manually  clustered  together 
those  features  that  seemed  to  be  close  to  a  location  at  which  it  seemed  reasonable 
that  a  feature  should  appear.  There  was  no  guarantee  that  there  was  exactly  one 
feature  from  each  image  group  in  the  cluster;  some  clusters  probably  were  missing 
representative  features  from  some  images,  some  clusters  probably  contained  several 
features  from  the  same  image.  Also,  we  didn’t  take  all  possible  clusters,  only  those 
which  seemed  subjectively  appropriate.  Despite  these  issues,  we  claim  that  since 
the  model  representation  is  chosen  by  the  user  (f.e.,  which  model  features  comprise 
the  representation),  it  is  not  unreasonable  for  the  user  to  determine  the  range  of 
locations  at  which  a  projected  model  feature  may  appear.  As  to  the  question  of 
variable  number  of  features  per  image  included  in  a  single  distribution,  we  note  that 
if  the  image  feature  extraction  process  drops  a  feature  for  one  of  the  images  in  the 
distribution,  there  is  nothing  we  can  do.  If  one  image  contains  several  features  close 
to  the  desired  feature  location,  then  including  all  of  them  in  the  distribution  implies 
that  any  of  them  is  a  feasible  match  for  a  model  feature  which  projects  to  near  that 
location. 
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6.1.4  Using  DiflPerent  Feature  Types 


Now  that  we  have  several  feature  types,  each  with  their  associated  cto,  we  look  at  the 
problem  of  combining  information  for  a  hypothesis  consisting  of  pairings  of  different 
feature  types:  i.e.,  suppose  we  have  a  hypothesis  consisting  of  a  size  3  pairing  of 
feature  types  1,  2  and  3  with  associated  error  standard  deviations  cti,  <T2  3-nd  <73 
respectively.  Then  the  possible  locations  of  a  fourth  point  (q,i3)  of  feature  type  4 
with  error  standard  deviations  <74  remains  centered  at  the  expected  location,  but  with 
variance 
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We  can  still  weight  the  occurences  of  a  corroborating  point  as  before.  However, 
the  calculation  of  the  densities  for  V'm  and  become  more  involved,  since  we  can 
no  longer  use  the  same  approximation  for  and  If  the  cr's  are  not 

all  equal,  these  density  functions  are  dependent  on  the  four  new  random  variables 
<7,,  i  =  1 . . .  4,  whose  distiibutions  are  different  for  each  model.  Though  it  is  still 
possible  to  do  the  calculation,  it  is  much  more  complex  than  before. 


6.2  Building  the  Planar  Model 

Building  models  for  2D  planar  objects  is  particularly  easy,  since  a  single  image  con¬ 
tains  sufficient  information  to  do  it.  In  order  to  be  able  to  use  the  error  model  which 
we  have  analysed  and  measured,  we  build  our  model  as  follows:  a  single  image  of 
the  model  at  in  an  arbitrary  pose  is  run  through  the  feature  detector.  The  user  then 
clicks  on  clusters  of  points  appearing  near  the  location  of  a  desired  feature;  the  mean 
of  this  cluster  is  then  incorporated  into  the  model  representation.  This  method  of 
building  the  model  is  compatible  with  the  way  in  which  we  measure  and  represent 
error  in  our  analysis. 


6.3  Applying  the  Error  Analysis  to  Automatic 
Threshold  Determination 

Here’s  an  example  of  an  application  of  our  error  analysis  to  a  typical  problem  in 
object  recognition  —  automatic  threshold  determination  for  a  system  which  uses  our 
recognition  algorithm.  Optimally,  we  would  like  to  build  a  demonstration  system  in 
which,  given  a  model  and  some  image,  the  user  specifies  a  certainly  level  up  front, 
i.e.,  “I  don’t  want  the  system  to  tell  me  about  anything  unless  it  is  90%  certain  that 
it  is  ^n  instance  of  the  model”.  In  order  to  achieve  this,  it  would  have  to  be  the  case 
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that 


- >  0.9 

i^oPo  +  ^fPf 

in  which  Nq  and  iV/j  are  the  total  number  of  true  and  false  hypotheses,  respectively. 
However,  in  our  problem  these  numbers  are  unknown;  the  only  control  we  have  is 
over  the  values  of  Pp  and  Pq  and  this  does  not  bound  the  certainty  of  the  result. 

Thus  our  demo  will  be  as  follows:  the  user  is  able  to  specify  a  desired  value  for  either 
Pf  or  Pd.  The  image  is  processed  to  find  the  number  of  model  and  image  features, 
and  then  the  system  constructs  the  associated  ROC  curve  and  finds  the  implied  Pd 
(if  a  Pp  was  specified,  otherwise  the  implied  Pd  for  the  specified  Pp),  and  what 
threshold  will  give  that  performance.  It  then  notifies  the  user  of  the  implications  of 
the  choice. 

As  it  turns  out,  the  simplifying  assumption  that  the  clutter  is  randomly  distributed  in 
real  images  is  not  only  incorrect,  but  the  deviation  from  the  modeled  error  also  very 
strongly  affects  the  way  the  system  works.  The  reasons  for  this  are  illustrated  by  an 
extreme  example;  imagine  an  image  in  which  n  feature  points  appear  in  the  left  side 
of  the  image  while  the  right  side  has  none.  When  we  project  an  error  disk  into  the 
image,  if  it  appears  in  the  left  side  it  is  twice  as  likely  to  encompeiss  an  image  point 
at  random  than  our  prediction.  In  addition,  the  denser  region  will  more  likely  be 
sampled  (in  this  example,  will  definitely  be  sampled)  for  the  .3  random  image  points 
chosen  at  random  to  form  the  pose  hypothesis,  making  it  much  more  likely  that  the 
remaining  error  disks  also  project  to  the  left  side  of  the  image.  These  two  effects 
result  in  a  much  higher  effective  density  than  indicated  by  the  mere  number  of  points 
appearing  in  the  image. 

We  can  attempt  to  fix  to  the  problem  in  two  ways:  we  can  either  estimate  an  effective 
image  density  as  a  function  of  density  variability  across  the  image  and  use  a  single 
ROC  curve  and  threshold  per  image,  as  we  have  been  doing  up  until  now.  Or,  we  can 
calculate  the  effective  density  per  hypotln'sis  (that  is,  density  of  the  region  in  which 
the  projected  model  falls),  and  use  a  different  ROC  and  threshold  per  hypothesis. 
We  chose  the  latter  approach,  that  is,  we  chose  to  calculate  an  ROC  curve  not  for  an 
entire  picture  (since  a  single  value  for  n  does  not  suffice  to  describe  all  hypotheses 
when  the  density  is  so  variable  across  the  image)  but  rather  on  a  per-hypothesis  basis. 
So,  instead  of  being  able  to  predict  a  single  threshold  for  all  hypotheses  emanating 
from  an  image,  we  calculate  the  threshold  every  time  we  test  a  different  hypothesis. 
Since  we  are  changing  the  ROC  curve  per  hypothesis,  we  can  choose  the  threshold  to 
constrain  either  the  false  alarm  rate  or  the  true  detection  rate,  but  not  both. 


6.3.1  The  Problem  with  the  Uniform  Clutter  Assumption 

In  this  section  we  illustrate  in  more  detail  the  problem  with  the  uniform  clutter 
assumption.  First  we  show  the  original  demonstration  in  which  we  found  a  discrep¬ 
ancy  between  our  predicted  and  actual  behavior  of  the  system,  and  subsequently,  a 
sequence  of  experiments  to  isolate  its  cause. 
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Initially  we  built  a  demo  which  runs  in  one  of  two  modes,  random  and  exhaustive. 
For  both  modes,  the  system  takes  as  its  input 

•  The  model,  consisting  of  a  list  of  2D  feature  locations, 

•  The  image,  also  consisting  of  a  list  of  2D  feature  locations, 

•  Estimated  occlusion  level,  which  is  a  number  between  0  and  1,  inclusive, 

•  The  value  of  (Tq  for  this  feature  type. 

In  addition  there  are  two  optional  arguments  SUB-MODEL  and  SUB-IMAGE,  which 
are  subsets  of  the  model  and  image,  respectively.  If  these  optional  arguments  are  non 
empty,  the  demo  runs  in  exhaustive  mode,  otherwise  it  runs  in  random  mode.  The 
motivation  for  these  optional  arguments  is  to  limit  the  number  of  hypotheses  tested 
to  a  reasonable  size,  and  to  be  able  to  include  some  correct  hypotheses  among  those 
tested.  For  instance,  in  the  telephone  test  with  33  model  features  and  250  image 
features,  the  number  of  hypotheses,  though  polynomial,  is  still  «  4  x  lO^U  Even  if 
we  could  check  one  hypothesis  per  second,  this  would  still  take  13  thousand  years, 
risking  a  very  dull  demo  for  the  user.  However,  when  a  sub-model  and  sub-image 
group  of  size  4  that  correctly  correspond  to  each  other  is  specified,  then  the  demo 
exhaustively  tests  96  hypotheses  of  which  4  are  correct. 

When  the  demo  runs  in  exhaustive  mode,  the  user  is  asked  to  specify  a  desired  P/r. 
The  number  of  model  and  image  features  implies  a  single  ROC  curve,  and  the  user 
specified  Pp  implies  a  particular  Pq  and  threshold.  The  system  reports  to  the  user 
the  implied  Pq  and  proceeds  to  cycle  through  €ill  size  3  hypotheses  formed  by  cor¬ 
respondences  between  the  model  and  image  subsets,  showing  the  user  all  hypotheses 
which  score  above  the  threshold.  The  user  answers  each  query  with  “correct”  or 
“incorrect”,  and  the  number  of  times  the  system  makes  a  mistake  is  tallied. 

When  the  demo  runs  in  random  mode,  the  user  is  aisked  to  specify  only  a  Pp,  after 
which  1000  randomly  chosen  hypotheses  are  tested,  the  assumption  being  that  the 
probability  of  randomly  choosing  a  correct  one  is  infinitesimal. 

The  output  of  the  demo  is  a  histogram  of  the  weights  of  all  the  hypotheses  tested. 
If  the  demo  was  in  random  mode,  the  normalized  histogram  should  have  the  same 
distribution  as  W-ff.  If  the  demo  was  in  exhaustive  mode,  then  the  predicted  to 
empirical  {Pp,Pd)  point  is  illustrated  on  a  graph. 

We  ran  the  demo  in  exhaustive  mode  on  the  telephone  image  shown  in  Figure  6-7  with 
a  user  specified  certainty  level  of  .99.  The  first  part  of  the  figure  shows  the  grey-scale 
image  of  the  telephone  with  the  points  chosen  to  comprise  the  model  indicated  in 
white.  The  bottom  figures  show  a  correct  and  incorrect  hypothesis  that  the  system 
came  up  with  that  exceeded  the  threshold. 

In  terms  of  performance,  the  demo  failed  quite  dramatically,  showing  far  more  incor¬ 
rect  hypotheses  exceeding  the  predicted  threshold  than  should  have  been  the  case. 
Upon  inspection  the  cause  for  this  breakdown  is  easily  identified;  running  the  demo 
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in  random  mode  with  the  same  model  and  image  produced  the  histogram  shown  in 
Figure  6-8.  The  figure  shows  the  predicted  density  of  Wjf  for  m  =  30.  v  =  248  super¬ 
imposed  on  the  normalized  histogram  of  the  actual  density  function.  Both  the  mean 
and  variance  are  much  larger  than  they  should  be.  The  question  is,  what  is  causing 
such  a  large  discrepancy? 

We  pinpointed  the  problem  by  performing  the  following  sequence  of  experiments. 

(A)  First  we  eliminated  the  possibility  that  the  sheer  number  of  model  and  image 
points  was  the  culprit  by  performing  the  same  simulation  as  was  done  to  derive 
the  ROC  curves  in  Chapter  4  for  the  same  value  for  m  and  n  as  in  the  tele¬ 
phone  model  and  image.  The  results  of  the  simulation  follow  the  predictions  of 
the  error  analysis  very  satisfactorily  and  normalized  histogram  of  the  weights, 
approximating  the  density  of  H'^,  is  shown  in  Figure  6-9.  This  indicates  that 
the  problem  lies  elsewhere. 

(B)  To  eliminate  the  possibility  that  something  about  the  model  itself  was  causing 
the  behavior  (for  instance,  the  model  symmetry),  we  created  an  image  in  which 
the  model  was  present,  but  all  the  remaining  image  points  were  redistributed 
uniformly  over  the  image,  maintaining  the  same  values  of  m  and  n.  We  then  ran 
the  demo  in  random  mode,  and  found  that  the  resulting  histogram  of  weights 
also  conformed  to  the  predictions  of  the  error  analysis  (Figure  6-10).  This  also 
pinpoints  the  problem,  since  the  only  difference  between  this  experiment  and 
the  original  demo  was  the  distribution  of  the  clutter  points,  thus  isolating  the 
cause  of  the  discrepancy. 

(C)  Lastly,  we  tested  to  make  sure  that  model  pose  did  not  affect  the  results  by 
running  the  same  test  as  (B),  but  translating  the  model  points  to  the  very 
top  of  the  image  (Figure  6-11).  This  did  affect  the  weight  histogram  slightly, 
but  in  the  other  direction  —  that  is,  it  served  to  make  the  Pf  prediction  an 
overestimate,  not  an  underestimate,  of  the  clutter  effects. 


6.3.2  Finding  a  Workaround 

Density  Correction  Factor 

Our  first  attempt  at  fixing  the  problem  is  to  determine  the  effective  image  density. 
If  we  can  do  this,  then  we  can  maintain  a  single  ROC  and  threshold  per  image. 
Let  us  define  a  quantity  which  we  will  name  the  density  correction  factor.  This 
is  an  empirically  derived  number  which  serves  as  a  kind  of  amplification  factor,  in 
that  it  scales  the  actual  number  of  image  features  to  yield  the  effective  number  of 
image  features.  The  procedure  for  finding  it  is  simple:  we  subdivide  the  image  into 
a  16  X  16  grid  of  regions,  each  one  with  approximately  uniformly  distributed  image 
clutter.  During  the  demo,  every  time  an  error  disk  is  projected  into  the  image,  a 
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Figure  6-7;  The  top  figure  shows  the  original  image,  with  the  feature  points  chosen  for 
the  model  indicated  in  white.  The  middle  figure  is  a  correct  hypothesis  that  exceeded  the 
predicted  threshold,  and  the  bottom  shows  an  incorrect  one.  The  points  indicate  image 
feature  points,  and  the  circles  indicate  projected  weight  disks. 
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Figure  6-8:  The  histogram  for  the  weights  of  incorrect  hypotheses  chosen  from  the  original 
image.  Note  that  the  mean  and  variance  greatly  exceeds  those  of  the  predicted  density  of 
Wjf  for  m  =  30,  n  =  248. 
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Figure  0-9:  The  histogram  of  Wjff  for  2500  randomly  chosen  hypotheses  For  this  model 
and  image,  m  =  30,  n  =  248,  when  the  clutter  is  uniform  and  the  model  does  not  appear 
in  the  image.  The  prediction  closely  matches  the  empirical  curve. 
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Figure  6-10:  The  top  picture  shows  the  model  present  in  the  image,  but  with  the  clutter 
points  redistributed  uniformly  over  the  image.  The  actual  density  of  Wjj  closely  matches 
the  predicted  density. 
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Figure  6-11:  When  the  model  points  are  displaced  to  the  top  of  the  image,  the  resulting 
density  of  W-fj  is  affected,  but  in  the  other  direction.  That  is,  the  noise  effects  are  now 
slightly  overestimated  instead  of  underestimated. 
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count  for  that  region  is  incremented.  Finally,  the  histogram  is  normalized  (turning 
it  into  a  'ID  probability  density  function  for  the  probability  of  being  hit  by  an  error 
disk),  and  from  this  we  calculate  the  expected  number  of  image  points  per  region 
by  multiplying  the  normalized  histogram  by  the  number  of  image  points  per  region. 
Finally  this  is  turned  into  the  density  correction  factor  by  multiplying  it  by  (number 
of  regions  /  number  of  image  points). 

We  used  the  density  correction  factor  to  get  an  idea  of  the  difference  between  the 
number  of  clutter  points  that  we  are  using  in  our  calculation  versus  the  effective 
amount  of  clutter.  We  would  expect  the  demo  to  produce  a  correction  factor  of 
%  1  when  run  on  a  completely  uniformly  distributed  image,  with  a  higher  number 
indicating  a  higher  variability  in  density.  As  expected,  the  original  image  resulted 
in  a  correction  factor  of  »  4.57,  while  experiment  A  (completely  uniform  image) 
yielded  a  correction  factor  of  «  1.13.  Experiments  B  (model  with  uniform  noise), 
and  C  (displaced  model  with  uniform  noise)  yielded  intermediate  values  of  «  2.03 
and  «  1.75,  respectively.  Note  that  knowing  this  number  doesn't  directly  suggest  a 
solution,  since  a  correction  factor  of  >  1  doesn’t  imply  that  the  method  breaks  down. 
Rather,  it  simply  confirms  that  regions  of  high  clutter  density  are  actually  hit  more 
often  than  the  low  clutter  regions,  as  we  suspected. 

Threshold  per  Hypothesis 

It  is  clear  from  our  work  up  until  this  point  that  the  uniform  clutter  assumption  does 
not  adequately  model  the  clutter  in  real  images,  and  we  cannot  fix  the  method  by 
amplifying  the  number  of  image  points  in  a  naive  way.  Instead,  we  have  modified 
the  method  to  work  on  a  per-hypothesis  basis.  That  is,  we  use  the  same  grid  of 
uniformly  distributed  density  regions  to  estimate  the  effective  image  density  every 
time  we  project  the  model  into  the  image,  and  calculate  the  ROC  curve  and  threshold, 
assuming  a  fixed  certainty  level.  This  implies  that  we  cannot  predict  the  overall 
probability  of  a  miss  at  the  outset  of  the  demo,  since  it  changes  for  every  hypothesis, 
but  we  can  set  the  threshold  to  maintain  a  fixed  probability  of  false  alarm. 


Implications  of  the  Uniform  Clutter  Assumption 

We  have  definitively  shown  that  the  assumption  of  uniformly  distributed  clutter  un¬ 
derestimates  the  negative  effects  of  clutter  for  this  recognition  algorithm.  Though 
the  number  of  images  we  have  examined  is  not  enormous,  it  is  quite  safe  to  say  that 
one  cannot  assume  that  the  feature  points  in  an  image  will  be  so  distributed,  and 
so  any  analysis  which  depends  on  this  aissumption  will  underestimate  the  effects  of 
clutter.  To  our  knowledge,  all  error  analyses  that  have  been  done  until  now  in  the 
field  of  computer  vision  have  used  this  assumption. 
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6.4  Demo 


We  revised  the  demo  to  take  into  account  the  density  correction  per  hypothesis.  The 
demo  runs  as  before,  with  two  changes.  First,  the  user  is  able  to  specify  either 
a  Pf  or  Pp  level.  If  a  Pp  level  is  specified,  then  only  correct  correspondences  are 
tested.  These  correct  correspondences  are  pa-ssed  to  the  demo  through  the  parameters 
SUB-MODEL  and  SUB-IMAGE.  Second,  the  user  is  not  told  at  the  outset  what  the 
implied  Pf  rate  (or  Pp  rate)  will  be  for  the  specified  level,  since  it  changes  for  every 
hypothesis. 

In  this  section  we  present  the  output  of  the  demo  in  action.  In  theory  the  demo 
should  work  for  any  planar  object  in  an  arbitrary  pose,  but  in  practice  the  fact  that 
we  are  working  under  perspective  instead  of  orthographic  projection  w’ill  lead  to  errors 
that  we  do  not  expect  will  be  adequately  modeled  by  a  Gaussian.  For  this  reason 
we  cannot  vary  the  pose  of  the  object  in  our  experiment  (except  for  translations  in 
the  X  and  y  direction),  and  so  for  this  limited  case  we  can  include  3D  models  in  our 
domain. 

For  the  demo,  then,  we  can  use  all  the  3D  objects  that  we  have  been  working  with 
until  now,  namely,  the  telephone,  fork,  and  army  knife.  We  work  with  a  single  object 
at  a  time.  A  model  of  an  object  is  constructed  from  a  single  image  of  it  by  first 
processing  the  image  to  find  all  the  feature  locations,  displaying  their  2D  locations, 
and  then  mousing  on  points  which  we  want  to  be  in  the  model.  To  test  the  validity 
of  the  error  analysis,  we  run  the  demo  on  a  different  image  than  the  one  from  which 
the  model  was  constructed. 


6.4.1  Telephone 

In  this  test,  we  used  the  telephone  model  that  was  shown  in  Figure  6-7.  For  every 
hypothesis,  the  effective  density  is  calculated,  and  the  ROC  curve  for  that  model 
size  and  image  density  determined.  The  threshold  aissociated  with  the  ROC  point 
(Pf,  Pp)  on  the  curve  is  found,  and  if  the  weight  exceeds  the  threshold,  the  hypothesis 
is  displayed. 

Table  6.1  shows  the  results  of  experiments  in  which  a  particular  rate  of  either  false 
alarm  or  true  detection  was  given,  then  the  threshold  was  dynamically  set  per  hy¬ 
pothesis  to  maintain  the  specified  rate.  The  same  three  experiments  were  performed 
for  four  different  values  of  <To  including  that  found  in  Section  6.1.2.  The  first  column 
is  the  <To  value  that  was  assumed  for  the  experiment.  The  second  and  third  columns 
contain  the  user  specified  Pf  or  Pp.  In  the  fourth  column  is  the  total  number  of  hy¬ 
potheses  tested.  The  fifth,  sixth  and  seventh  columns  contain  the  expected  number 
of  hypotheses  of  those  tested  that  should  pass  the  threshold,  the  actual  number  of 
hypotheses  that  pass  the  threshold,  and  the  error  bar  for  the  experiment  (we  show 
one  standard  deviation  =  \JtPf{  1  —  Pf),  t  =  number  of  trials.  The  actual  Pf  (or 
Pp)  is  shown  in  the  eighth  column.  The  last  column  shows  the  average  distance  of  all 
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(To 

Pf 

Pd 

Total 

Expect 

Actual 

Error 

Act  Ppj Pd 

0.5 

.01 

1081 

11 

15 

3.27 

.014 

.001 

1121 

1 

0 

1.06 

0 

.9 

512 

461 

503 

6.8 

.98 

0.65 

.01 

1082 

29 

3.27 

.027 

-3.7 

.001 

1080 

0 

1.04 

0 

.9 

510 

510 

6.8 

1.0 

2.7 

1.0 

.01 

11 

23 

3.26 

.021 

-3.24 

.001 

1 

0 

1.05 

0 

.9 

wm 

463 

501 

6.8 

.97 

5.5 

2.0 

.01 

1113 

11 

21 

3.32 

.019 

-1.95 

.001 

1091 

1 

0 

1.05 

0 

.9 

508 

457 

496 

6.8 

.98 

4.2 

Table  6.1:  Results  of  experiments  for  the  telephone.  The  first  column  is  the  oq  value  that 
was  assumed  for  the  experiment.  The  second  and  third  columns  contain  the  user  specified 
Pp  or  Pd-  In  the  fourth  column  is  the  total  number  of  hypotheses  tested.  The  fifth,  sixth 
and  seventh  columns  contain  the  expected  number  of  hypotheses  of  those  tested  that  should 
pass  the  threshold,  the  actual  number  of  hypotheses  that  pass  the  threshold,  and  the  error 
bar  for  the  experiment  ( we  show  one  standard  deviation  =  y/tPpi  1  -  Pp)^  t  =  number  of 
trials).  The  actual  Pp  (or  Pd)  is  shown  in  the  eighth  column.  The  last  column  shows  the 
average  distance  of  all  the  hypotheses  that  passed  the  threshold  from  E[Wh]- 

the  hypotheses  that  passed  the  threshold  from  E[W’//],  The  distance  is  given  in  terms 
of  the  standard  deviation  of  Wh,  that  is: 

^  _  E\Wh]  -  w 
"  v^ar(ir„) 

where  w  is  the  weight  of  the  hypothesis  which  crossed  the  threshold. 

This  model  and  image  contained  33  and  231  features,  respectively.  In  our  experiments 
we  tested  several  values  of  (Tq  to  see  how  varying  that  value  would  affect  the  accuracy 
of  our  predictions.  In  Section  6.1.2  when  we  measured  the  noise  associated  with 
the  maximum  curvature  feature  type  for  the  phone  image  group,  we  determined  that 
(To  =  0.65.  As  we  can  see  from  the  table,  the  results  were  not  significantly  different  for 
the  different  values  of  <7o,  though  a  value  of  ctq  =  0.5  seemed  to  be  most  accurate  for 
the  experiment  Pp  =  0.01.  Oddly  enough,  using  the  meaisured  value  for  <To  resulted 
in  the  worst  predictions. 

For  all  of  the  experiments,  we  see  that  the  threshold  predicted  to  maintain  a  specified 
Pp  of  .01  did  not  achieve  the  desired  false  detection  rate.  The  reason  for  this  is  that 
our  assumption  that  the  density  function  of  is  Gaussian  is  false;  in  fact,  the  upper 
tail  of  the  actual  distribution  of  Wjf  contains  more  of  the  distribution  than  a  Gaussian 
with  the  same  mean  and  variance  would.  Despite  this,  the  predicted  thresholds  for 
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Figure  6-12:  The  incorrect  hypotheses  that  feU  above  the  threshold  chosen  to  maintain  a 
Pf  of  0.01.  For  these  experiments,  (Tq  =  2.0.  The  circles  show  the  locations  of  the  projected 
weight  disks,  while  the  points  show  the  feature  locations. 


<To 

Pf 

Pd 

Total 

Expect 

Actual 

Error 

Act  Pp/Pd 

EICl 

1.0 

.01 

1157 

12 

44 

3.38 

.038 

-1.6 

.001 

1139 

1 

0 

1.02 

0 

.9 

514 

463 

0 

6.8 

0 

1.8 

.01 

1075 

11 

51 

3.26 

.047 

-.98 

.001 

1017 

1 

1 

1.01 

.001 

-2.77 

.9 

512 

461 

318 

6.8 

.62 

-.048 

2.0 

.01 

1126 

11 

40 

3.34 

.036 

-.93 

.001 

1100 

1 

0 

1.05 

0 

.9 

541 

487 

423 

7.0 

.78 

-2 

3.0 

.01 

1020 

10 

29 

3.18 

.028 

-.44 

.001 

1044 

10 

2 

1.0 

.002 

-1.06 

.9 

517 

465 

431 

6.8 

.83 

.5 

Table  6.2:  Experimental  results  for  the  army  knife.  Though  (Tq  for  this  image  group  was 
determined  to  be  1.8,  we  see  that  the  predictions  for  a  value  of  <To  =  2  or  3  are  much  better. 
The  columns  indicate  the  (Tq  used  for  the  experiment,  either  Pp  or  Po  .  the  total  number 
of  hypotheses  tested,  the  expected  number  of  hypotheses  to  score  above  the  threshold,  the 
actual  number  that  scored  above  the  threshold,  and  the  error  bar  for  this  value  (we  show 
one  standard  deviation  =  \/tPp(  I  -  Pp),  t  =  number  of  trials).  The  actual  Pp  (or  Pd) 
is  shown  in  the  next  column,  and  the  last  column  shows  the  average  distance  of  all  the 
hypotheses  that  paissed  the  threshold  from  E[M'h) 

an  even  lower  probability  of  false  alarm  (*.e.,  Pp  =  0.001)  work  well. 

Lastly,  we  note  that  on  the  average,  even  those  false  hypotheses  which  passed  the 
threshold  had  weights  which  were  still  significantly  below  the  mean  of  IT//,  w'hile  the 
weights  of  true  hypotheses  passing  the  threshold  were  significantly  above.  Though  not 
justified  by  the  analysis,  this  information  might  also  be  used  to  discriminate  between 
true  and  false  hypotheses  pcissing  the  threshold. 


6.4.2  Army  Knife 

The  same  experiment  was  done  with  the  army  knife.  This  example  differs  from  the 
previous  example  in  that  we  used  far  fewer  model  points,  14  versus  33.  This  brings 
the  value  of  E[IT//]  much  closer  to  £[1^77],  and  in  general  we  found  that  the  system 
behaved  less  well  due  to  this.  For  this  experiment,  the  number  of  model  and  image 
features  were  14  and  162  respecti\  ely.  The  model  plus  two  examples  of  hypotheses 
which  fell  above  the  threshold  are  shown  in  Figure  6-13. 

The  CTo  for  this  feature  type  was  determined  in  Section  6. 1 .2  to  equal  1 .8.  Referring 
to  Table  6.2,  we  see  that  using  this  value  for  the  sensor  noise  results  in  a  very  poor 
prediction  for  Pd-  For  example,  for  a  specified  Pd  value  of  0.9  we  can  see  that 
the  actual  percentage  of  true  hypotheses  that  passed  the  threshold  was  only  0.62. 
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Figure  6-13:  The  top  figure  shows  the  original  image,  with  the  feature  points  chosen  for 
the  model  indicated  in  white.  The  middle  figure  is  a  correct  hypothesis  that  exceeded  the 
predicted  threshold,  and  the  bottom  shows  an  incorrect  one. 
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Figure  6-14:  I'lie  model  of  (lie  fork  suporiiiiposed  in  white  oii1«>  llie  original  image. 

Haising  tli<'  value  of  (T„  improved  lli<*  pre<lietion  for  hofli  /V  aiul  /'/->•  "’dlt  iIk*  I«‘s1 
performanee  i)redirt ion  resulting  frotn  using  a  e;dti(>  of  rr„  =  :i.O.  willi  =  2.0 
a  (•|os<*  second.  ilo\v<'ver.  ev<'u  the  pr<*dirtiotts  for  tlu'se  rr,,  vahies  rt'sults  in  too 
high  a  fal.se  alarm  rat<‘  wIkmi  a  value  of  /V  =  0.01  is  sp(*(  i(i<‘d  again,  tising  a 
(laiissian  ap|)roximat  ion  for  tin*  density  of  Wjj  eaus<'s  ns  to  underestimate  tin*  ext<>nt 
of  the  upper  tail  of  the  actual  distrihutioti.  .\s  in  tin’  previous  model,  the  predicted 
l)erformain-e  <  losely  matcin’il  actual  performance  for  a  1).  value  of  0.001. 

I  iilike  in  tin*  previous  set  of  <’xperiments  with  the*  telephone,  there  is  not  much 
dilferi’tice  hel  ween  t  he  averag<’  distance  frotn  l^[i  \i]  of  true  and  false  hypothesc’s  which 
pass  the  ihresinjhi.  W'heu'as  Ix’fore  ther<’  was  a  chance  that  this  f'Xira  inf<»rmation 
might  further  help  discriminat<’  hetw*^’!!  true  anti  false*  hypotln’ses  which  pass  the 
threshohl.  for  this  nunh’l,  atnl  image*  the*  extra  itiformation  is  no  help. 

6.4.3  Fork 

rin-  same-  group  e»f  e*xpe*rime’tit s  fejf  the-  fork  image*  grotip  are  eh’picle*el  in  Tahle  (».;!. 
The-  mexle-l  conlaitie'el  0  fe*ature*  peeints  aiiel  is  slnm’ii  in  I-'ignre  (i- 1  I.  while  the  image* 
ee»nlaine-d  170.  I'he*  table*  inelie’ates  that  the*  pri*e|ie't iems  for  the*  «■„  =  1.7  ainl  2.0 
give-  the-  he-sl  ie-sults  eif  the*  group,  t  heuigli  the*  pre'elie't  ieui  feU*  rr„  =  |.7  is  Weirse  fe»r  a 
spe-eilie-el  /'/  --  0.01.  anel  the*  preelie'l iein  for  er,,  =  2.0  is  Weifse  feU'  /*/  =  O.IMII.  The* 
forme-i  value-.  /t„  ~  |  .7.  was  the-  value*  e|e*le*rmineel  feir  this  image*  greuip  in  Se-einm 
(i.1.2.  ( le-ne-rally.  pe’lfeuitiatiee-  pre*elicl ieuis  We-re*  tnil  epiile*  as  sue-ee’ssful  fe*r  this  menlel 
as  for  I  lie  (it  si  two. 


I'l. 

••  •p' 


Figure  6-15;  Three  kinds  of  false  positives  that  occurred  for  a  highly  symmetric  model. 
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<To 

Pf 

Pd 

Total 

Expect 

Actual 

Error 

Act  Pf/Pd 

E(Z>) 

1.0 

.01 

1093 

11 

36 

3.28 

.033 

-.058 

.001 

1070 

1 

1 

1.03 

.001 

-1.84 

.9 

508 

457 

468 

6.8 

.92 

2.48 

1.7 

.01 

1052 

11 

56 

3.23 

.053 

.27 

.001 

1065 

1 

1 

1.03 

.001 

1.77 

.9 

529 

476 

475 

6.9 

.90 

2.55 

2.0 

.01 

1011 

11 

43 

3.16 

.043 

-.28 

.001 

1020 

1 

7 

1.01 

0.007 

1.82 

.9 

523 

471 

467 

6.9 

.89 

2.4 

3.0 

.01 

1043 

10 

59 

3.21 

.056 

.67 

.001 

1057 

10 

17 

1.03 

.016 

1.5 

.9 

512 

461 

444 

6.8 

.87 

1.6 

Table  6.3:  Experimental  results  for  the  fork  model. 

6.4.4  The  Effect  of  Model  Symmetry 

While  performing  the  previous  experiments,  it  was  noted  that  incorrect  hypotheses 
which  roughly  aligned  the  model  along  an  axis  of  symmetry  in  its  image  projection 
would  be  more  likely  to  get  a  high  score  and  pass  the  threshold.  The  second  hypoth¬ 
esis  in  Figure  6-12  is  an  example  of  this  phenomenon,  as  well  as  the  first  two  false 
hypotheses  shown  in  Figure  6-15.  These  “symmetric”  hypotheses  are  more  likely  to 
be  sampled  when  the  three  image  points  in  the  basis  actually  arise  from  the  model 
(while  not  correctly  corresponding  to  the  model  basis  tested).  To  test  this  effect,  we 
ran  the  above  experiments  for  some  sample  values  of  (Tq.  The  three  points  in  the 
image  basis  used  for  the  random  correspondence  was  restricted  to  those  arising  from 
the  model. 

The  results  in  Table  6.4  show  that  on  the  whole,  the  restriction  to  hypotheses  using 
image  bases  arising  from  the  model  causes  a  higher  false  positive  rate  than  the  same 
experiment  without  the  restriction.  This  does  not  prove  that  the  model  symmetry  is 
entirely  causing  this  effect,  especially  since  the  knife  model,  which  is  not  symmetric, 
shows  the  same  tendency  towards  a  higher  false  positive  rate,  while  the  fork,  which  is 
highly  symmetric,  does  not  (at  least  for  the  Pf  =  .01  experiment).  Nonetheless,  we 
suspect  that  model  symmetry  may  is  a  contributing  factor,  though  more  tests  would 
have  to  be  done  to  settle  the  matter  conclusively. 


6.4.5  Comparison  to  Results  Using  the  Uniform  Clutter 
Assumption 

In  the  previous  sections  we  performed  experiments  in  which  we  dynamically  set  the 
threshold  per  hypothesis,  depending  on  which  region  of  the  image  the  model  projected 
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Model 

0-0 

Pf 

Total 

#  Found 

This  Pp 

Previous  Pp 

Error 

Phone 

1.0 

.01 

1101 

37 

■Bgl 

|Hg 

.001 

1126 

2 

IHiS 

Knife 

3.0 

.01 

3.29 

.001 

BS 

1.01 

2.0 

46 

.043 

3.25 

22 

.021 

1.03 

Table  6.4:  Experimental  results  when  all  the  points  in  the  image  bases  tested  come  from 
the  model.  The  first  and  second  columns  contain  the  model  and  (Tq  tested.  The  next  columns 
contain  the  specified  Pp  ,  total  number  of  hypotheses  tested,  and  number  of  hypotheses  that 
passed  the  threshold.  The  next  two  columns  contain  the  actual  Pp  for  this  experiment  and 
the  value  for  the  same  experiment  in  which  the  tested  image  bases  are  not  constrained  to 
come  from  the  model  (this  vaJue  was  taken  from  the  previous  group  of  experiments.  Finally 
the  last  column  is  the  error  bar  for  the  experiment,  which  we  took  to  be  one  standard 
deviation  =  \/tPp(  I  —  Pp),  t  =  number  of  trials. 


Model 

Pf 

Total 

^  Found 

This  Pf 

Previous  Pp 

Phone 

1.0 

.01 

1080 

418 

.39 

.021 

Knife 

3.0 

.01 

1114 

528 

.47 

.028 

Fork 

2.0 

.01 

1061 

490 

.46 

.043 

Table  6.5;  Experimental  results  when  uniform  clutter  is  assumed.  The  first  and  second 
columns  contain  the  model  and  <to  tested.  The  next  columns  contain  the  specified  Pp  , 
total  number  of  hypotheses  tested,  and  number  of  hypotheses  that  passed  the  threshold. 
The  next  two  columns  contain  the  actual  Pp  for  this  experiment  and  the  value  for  the  same 
experiment  in  which  the  effective  density  is  calculated  per  hypothesis,  and  the  threshold 
dynamically  reset. 

to  under  the  tested  hypothesis.  We  have  shown  the  method  working  reasonably  well 
despite  a  slightly  higher  false  positive  rate  than  expected  for  some  cases.  One  source 
of  the  problem  may  possibly  be  that  the  image  was  indiscrimately  broken  into  a 
16  X  16  grid  for  the  effective  density  calculation,  in  which  the  clutter  was  assumed  to 
be  uniformly  distributed  v/ithin  a  rectangle  of  the  grid.  This  approximation  may  not 
be  quite  correct  for  the  images  used. 

Lest  the  reader  question  the  advantage  of  using  a  dynamic  threshold,  we  illustrate 
.some  experiments  for  the  case  when  clutter  is  assumed  to  be  uniform.  That  is,  a 
single  ROC  curve  and  threshold  is  calculated  for  the  entire  image,  and  any  hypothesis 
which  falls  above  it  is  counted.  The  results  are  shown  in  Table  6.5.  The  table  clearly 
shows  the  necessity  of  using  dynamic  thresholds,  and  one  can  appreciate  how  well  our 
method  actually  performs  when  compared  with  these  results. 
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6.5  Conclusion 


In  this  chapter  we  have  demonstrated  the  applicability  of  many  of  the  ideas  developed 
so  far.  First  we  argued  that  the  positional  error  of  features  due  to  effects  such  as 
lighting  and  smoothing  are  well  modeled  by  a  Gaussian  approximation,  and  showed 
how  to  determine  the  size  of  the  Gaussian  for  different  feature  types.  Finally  we 
showed  an  example  of  automatic  threshold  setting  applied  to  the  problem  of  finding  a 
correct  correspondence  between  model  and  image  features.  It  was  found  that  in  two 
of  the  three  image  groups,  the  better  predictions  were  achieved  when  using  a  (Tq  for 
each  image  group  that  was  slightly  higher  than  that  found  in  our  measurements. 

It  was  demonstrated  early  on  that  the  assumption  that  clutter  is  uniformly  distributed 
over  an  image  greatly  underestimates  the  effects  of  clutter  on  the  algorithm  we  are 
using  when  applied  to  a  real  recognition  problem.  Because  of  this,  we  were  not  able 
to  apply  exactly  the  same  approach  that  we  demonstrated  on  simulated  images  in  the 
Chapter  4;  rather,  we  had  to  adjust  the  threshold  for  every  model  pose  hypothesis, 
depending  on  the  clutter  levels  of  the  regions  that  the  model  projected  to.  This 
meant  that  when  a  hypothesis  projected  to  a  region  of  high  density,  it  needed  far 
more  evidence  to  be  considered  a  possible  detection  than  otherwise.  We  showed  this 
approach  working  reasonably  well  for  a  small  group  of  images.  When  given  false  alarm 
rates  of  .01  and  .001,  the  system  was  able  to  recalculate  the  threshold  per  hypothesis 
to  achieve  close  to  the  specified  performance. 

One  effect  that  we  noted  was  that  when  the  image  basis  used  in  the  correspondence 
was  constrained  to  come  from  the  model  points  in  the  image,  the  false  positive  rate 
tended  to  be  higher.  We  suspect  this  may  be  related  to  the  effect  of  model  symmetry, 
since  incorrect  correspondences  that  happened  to  project  the  model  to  a  position  that 
was  relatively  symmetric  to  the  actual  pose  would  often  pass  the  threshold.  This  event 
is  more  likely  to  occur  when  the  features  in  the  image  basis  come  from  the  model. 

In  the  next  chapter  we  will  discuss  the  implications  of  our  findings  to  other  existing 
recognition  techniques. 
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Chapter  7 

Implications  for  Recognition 
Algorithms 


The  error  analysis  we  have  presented  applies  not  only  to  alignment,  but  to  geometric 
hashing  as  well.  We  will  briefly  discuss  the  original  geometric  hashing  algorithm  and 
explain  the  modifications  that  are  required  to  be  able  to  apply  our  error  analysis. 
Finally  we  will  discuss  possible  applications  and  extensions  of  our  work. 

Because  we  used  affine  coordinates  as  the  model  representation  and  limited  the  Gaus¬ 
sian  weight  disk  to  a  radius  of  2<Te,  our  method  for  threshold  and  performance  pre¬ 
diction  applies  equally  well  to  both  alignment  and  geometric  hashing,  provided  the 
original  geometric  hashing  algorithm  is  modified  to  take  error  into  account  in  a  par¬ 
ticular  way.  First  we  will  discuss  the  original  algorithm,  and  then  we  will  describe 
the  modifications  required  for  the  error  analysis  to  apply. 


7.1  Geometric  Hashing 

The  geometric  hashing  method  was  introduced  by  Lamdan,  Schwartz  and  Wolfson 
in  [LSW87],  and  Hummel  and  Wolfson  in  [HW88].  The  algorithm  consists  of  two 
stages,  a  preprocessing  stage  in  which  a  lookup  table  is  created,  and  a  run  time 
stage  in  which  small  groups  of  image  au’e  features  used  to  access  the  lookup  table  for 
potential  matches. 

In  the  preprocessing  stage,  the  hash  table  is  constructed  Jis  follows:  Every  ordered 
triple  of  model  points  is  used  as  a  basis,  and  the  affine  coordinates  (a, /3)  of  all  other 
model  points  are  computed  with  respect  to  each  basis.  Thus,  if  mo,  mi  and  m2  are 
basis  points,  then  we  represent  any  other  feature  point  by 

m,  =  mo  -f  a, (mi  -  mo)  +  A(ni2  -  mo) 

The  basis  (mo,  mi,  m2)  is  entered  into  the  hash  table  at  each  (q^.  A)  location.  Intu¬ 
itively,  the  invariance  of  the  affine  coordinates  of  a  model  with  respect  to  3  of  its  own 
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points  as  basis  is  being  used  to  ‘‘precompute”  all  possible  views  of  the  model  in  an 
image.  The  precise  algorithm  is: 

•  for  every  ordered  model  triplet  Bk  =  (mo, 

—  for  every  other  model  point  rrij 

(i)  find  coordinates  mj  =  with  respect  to  basis  Bk 

(ii)  enter  basis  Bk  at  location  in  the  hash  table. 

The  running  time  for  this  stage  is  Olm”*),  where  m=number  of  model  points. 

At  recognition  time,  the  image  is  processed  to  extract  2D  feature  points.  Every  image 
triple  is  then  taken  as  a  basis,  and  the  affine  coordinates  of  all  other  image  points  are 
computed  with  respect  to  the  basis  to  index  into  the  hash  table  amd  “vote”  for  all 
bases  found  there.  We  will  use  the  term  “random  image  basis”  to  refer  to  an  image 
basis  which  contains  at  least  one  point  not  arising  from  the  model.  Intuitively  we 
are  searching  for  any  three  image  points  which  come  from  the  model,  and  using  the 
hash  table  to  verify  hypothesized  triples  of  image  points  as  instances  of  model  points. 
Such  an  image  triple  will  yield  a  large  number  of  votes  for  its  corresponding  model 
basis.  The  precise  algorithm  is: 

•  for  every  unordered  image  triplet  (*o,*i,*2) 

(a)  for  every  other  image  point  ij 

(i)  find  coordinates  ij  =  with  respect  to  basis 

(ii)  Index  into  the  hash  table  at  location  {aj,^j)  and  increment  a  his¬ 
togram  count  for  all  bases  found  there. 

(b)  If  the  weight  of  the  vote  for  any  basis  Bk  is  greater  than  some  threshold 
6,  stop  and  output  the  correspondence  between  triple  (jq, *i,*2)  and  basis 
Bk  ^is  a.  correct  hypothesis. 

A  single  pass  of  the  algorithm  corresponds  to  testing  a  single  image  basis  for  a  corre¬ 
spondence  to  any  model  basis.  In  some  versions  of  the  algorithm,  the  hypothesis  that 
is  output  subsequently  undergoes  a  verification  stage  before  being  accepted  as  cor¬ 
rect.  The  termination  condition  for  accepting  a  correspondence  of  bases  (and  hence 
a  pose  of  the  object)  and  the  implied  probability  of  true  detection  and  false  alarm 
are  exactly  the  issues  that  our  error  analysis  addresses. 


7.2  Comparison  of  Error  Analyses 

The  first  error  analysis  of  the  geometric  hashing  technique  was  done  by  Crimson, 
Huttenlocher  and  Jacobs  [GHJ91].  They  used  a  uniform  model  for  sensor  error,  and 
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concluded  several  things:  first,  that  when  sensor  error  is  taken  into  account  and  a 
particular  image  triplet  is  chosen  in  the  recognition  stage,  then  the  regions  in  the  hash 
table  that  are  consistent  with  the  sensed  position  of  any  fourth  image  point  (step  (ii;) 
are  ellipses  whose  center  and  axes  are  dependent  on  configuration  of  the  image  basis. 
Thus,  the  error  regions  themselves  cannot  be  taken  into  account  at  the  preprocessing 
stage,  but  rather  must  be  computed  in  step  (ii)  of  the  recognition  stage,  and  all  model 
bases  m  the  region  incremented. 

Second,  they  derived  the  probability  that  a  single  random  image  basis  would  match 
any  model  basis  as  follows:  Suppose  the  probability  that  a  single  random  image  point 
will  be  in  a  region  consistent  with  any  model  point  is  //  on  average.  Let  us  fix  the 
model  basis  that  we  are  interested  in.  The  probability  that  a  single  image  point  will 
fall  in  any  region  consistent  with  this  particular  model  basis  is 

p=l -(!-//)"• 


since  there  are  m  places  in  the  index  table  where  this  basis  appears,  and  the  image 
point  must  avoid  all  of  them.  However,  there  are  n  image  points,  so  the  probability 
that  this  particular  model  basis  gets  at  least  h  votes  is 


U'k  =  1  - 


/(i-p)"-* 


This  is  the  probability  that  a  single  random  image  basis  matches  a  fixed  model  basis. 
There  are  tn(m  —  l)(w  —  2)  bases  in  the  hash  table  (we  will  use  m(3)  to  denote  this 
expression),  so  probability  that  this  image  basis  will  contribute  at  least  h  votes  to 
any  model  basis  is 


1 

1 


—  P{ image  basis  contributes  >  h  to  no  model  basis} 

-(1  -li’A- )"'<-’> 


=  1  - 


-k 


•(3) 


This  is  the  probability  that  in  a  single  pass  through  the  recognition  stage  of  the 
geometric  hashing  algorithm,  the  image  basis  being  tested  will  find  a  match  of  at 
least  size  h  at  random.  They  presented  an  analogous  analysis  for  alignment,  which 
is  identical  except  the  roles  of  n  and  m  are  switched.  Thus  they  conclude  that  the 
probability  of  an  overall  false  positive  was  greater  for  the  geometric  hcishing  case  than 
for  alignment,  because  n  >  m  prevails  rather  generally. 

The  difference  in  the  positions  of  n  and  m  in  their  analysis  was  based  on  the  assump¬ 
tion  that  alignment  counts  at  most  one  image  point  per  model  disk  whereas  geometric 
hashing  counts  all  image  points  that  appear  in  the  model  disk.  This  is  equivalent  to 
the  distinction  between  Schemes  1  and  3  that  Wcis  discussed  in  Chapter  5.  However, 
the  geometric  hashing  scheme  can  be  easily  modified  to  use  either  collection  method 
by  keeping  track  of  whether  a  point  hzis  already  been  collected  from  that  particular 
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location:  in  fact,  this  is  the  method  generally  used.  The  alignment  method  can  also 
easily  use  either  collection  scheme.  Therefore,  the  probability  of  error  is  equivalent 
for  either  method  when  using  comparable  collection  schemes. 

Interestingly  enough,  we  have  shown  the  opposite  of  the  conclusion  of  [GHJ91]  with 
regard  to  the  probability  of  error  as  a  function  of  collection  method.  We  concluded 
in  Chapter  5  that  the  performance  of  Scheme  1  (counting  all  image  points  that  fall  in 
a  weight  disk)  is  better  than  that  of  Scheme  3  (at  most  one  image  point  per  weight 
disk):  this  is  because  even  when  the  clutter  is  very  high,  the  expected  number  of  points 
falling  inside  a  correctly  hypothesized  weight  disk  is  always  greater  than  the  expected 
number  of  points  falling  inside  a  random  weight  disk  regardless  of  the  clutter  level. 
Therefore,  the  expected  value  of  the  sum  of  points  appearing  in  the  disk  will  always 
be  higher  if  the  disk  is  correct.  The  other  collection  scheme  saturates  with  noise  once 
there  is  a  high  probability  that  at  least  one  image  point  will  appear  per  weight  disk. 
This  finding  does  not  contradict  the  analysis  in  {GHJ91]  since  the  weighting  scheme 
used  in  that  analysis  was  based  on  a  uniform  model  for  sensor  noise. 

To  apply  our  error  analysis  we  would  have  to  project  entire  error  ellipses  into  the 
hash  table  as  described  in  [GHJ91],  but  in  the  Gaussian  error  model  case,  the  ellipses 
would  be  smaller  and  we  would  increment  weights  for  model  bases  instead  of  votes. 
Now  we  can  appreciate  why  it  was  importan*.  to  limit  the  Gaussian  weight  disk  to  a 
finite  size.  If  the  distribution  were  unbounded,  we  would  have  to  go  through  the  entire 
table  and  contribute  some  small  weight  to  every  basis,  thus  changing  the  run  time 
of  the  original  geometric  hashing  algorithm.  Applying  our  method  to  this  domain 
results  in  being  able  to  derive  triples  of  {0,  Pp,  Pq)  for  the  termination  step  of  the 
geometric  hashing  algorithm  (step  (b)). 

Furthermore,  we  can  easily  calculate  the  probability  that  a  particular  image  basis 
will  match  any  model  basis,  as  was  done  in  [GHJ91].  We  already  mentioned  that 
the  geometric  hashing  technique  ran  be  considered  a  “filtering”  step  which  provides 
candidate  model  to  image  basis  correspondences  to  some  more  expensive  verification 
step.  Then  the  technique  would  be  considered  to  break  down  once  the  number  of 
matches  it  offers  up  is  too  high. 

Suppose  we  are  willing  to  verify  (by  alignment  or  any  other  verification  technique)  all 
bases  that  pass  our  threshold,  as  long  as  there  are  <  k  of  them.  Then,  an  overall  false 
positive  is  the  combined  event  that  the  three  image  points  being  tested  do  not  arise 
from  the  model,  yet  more  than  k  model  bases  “look  good”.  An  overall  true  positive 
is  the  combined  event  that  the  three  image  points  do  arise  from  the  model,  that  <  k 
model  bases  pass  the  test,  and  of  these,  one  of  them  is  the  correct  one.  We  will  call 
these  combined  events  Up  and  Up,  and 

P{nF}  =  i-ELo(’"r>)^Mi-^F)’"<^>- 

p{^d}  = 
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7.3  Summary 


The  error  analysis  and  threshold  prediction  method  that  we  derived  in  this  thesis 
are  directly  translatable  to  the  geometric  hashing  algorithm,  provided  the  latter  is 
modified  to  take  Gaussian  error  into  account. 
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Chapter  8 


Conclusion 


The  kind  of  analysis  we  have  done  in  this  thesis  is  crucial  in  order  to  build  robust 
systems.  One  way  of  making  a  system  more  robust  is  to  incorporate  several  different 
sensors  or  processing  modules  for  the  same  feature.  Ecich  sensor  or  module  htks  a 
weighting  attached  to  its  output  which  is  related  to  its  reliability.  The  process  of 
integrating  the  information  from  all  the  sensors  is  dependent  on  correctly  assessing 
the  reliability  of  the  sensors  under  what  conditions. 

What  we  have  done  in  this  thesis  is  to  begin  to  assess  the  reliability  of  a  vision 
sensor  for  a  fixed  algorithm.  We  expect  that  such  error  analyses  will  be  necessary  for 
integrating  vision  into  any  automated  system,  with  or  without  multiple  sensors. 


8.1  Extensions 

To  conclude,  we  have  demonstrated  an  error  analysis  for  alignment  or  geometric 
hashing,  and  a  companion  method  that  predicts  triples  of  (threshold,  Pp,  Pd)  for  a 
fixed  number  of  model  and  image  features.  The  method  as  presented  is  limited  to 
planar  models  solely  due  to  our  ignorance,  as  yet,  of  an  analytical  expression  for  the 
size  of  a  projected  error  disk  as  a  function  of  sensor  error  in  the  3D  case.  We  showed 
the  method  working  well  first  in  the  domain  of  simulated  models  and  images,  and 
subsequently  in  real  images. 

The  application  to  real  images  was  problematic  in  that  our  model  for  clutter  was 
not  accurate,  and  the  disparity  initially  resulted  in  an  unexpectedly  high  false  alarm 
probability.  We  modified  the  basic  method  to  tedce  the  non-uniformity  of  the  clutter 
distribution  into  account,  and  subsequently  demonstrated  a  method  to  dynamically 
reset  the  threshold  used  for  accepting  a  hypothesis  to  maint2dn  a  fixed  probability  of 
detection  or  false  alarm. 

There  are  several  areas  for  extending  the  initial  work  presented  in  this  thesis: 

•  The  most  significant  improvement  would  result  from  a  more  sophisticated  model 
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for  clutter.  Though  we  assumed  a  uniform  model,  which  is  a  standard  in  the 
field,  this  model  vastly  underpredicts  the  negative  effects  of  clutter  on  the  prob¬ 
ability  of  false  positives. 

•  When  a  model  is  symmetric,  often  a  pose  which  aligns  the  model  in  the  image 
along  an  axis  of  symmetry  will  appear  very  good.  Information  about  model 
symmetry  could  be  incorporated  into  the  method  such  that  when  we  are  testing 
such  a  case,  the  threshold  would  be  raised  accordingly. 

•  Though  we  used  simple  '2D  features  for  the  analysis,  there  is  nothing  inherent 
in  the  method  preventing  extensions  to  more  complex  features. 

•  The  method  can  easily  be  tailored  to  a  particular  model  database  by  repre¬ 
senting  the  score  distributions  for  correct  hypotheses  for  each  model,  and  using 
these  distributions  instead  of  the  generic  distributions  to  improve  performance 
for  the  given  database. 

It  is  our  belief  that  model  based  vision  algorithms  will  not  be  useful  unless  and  until 
we  can  know  how  much  faith  we  can  place  in  the  interpretations  given  by  them.  The 
w'ork  presented  in  this  thesis  is  a  step  towards  addressing  the  question. 


Appendix  A 

Glossary,  Conventions  and 
Formulas 


The  notation  that  I  use  in  the  thesis  generally  follows  the  conventions  used  in  Van 
Trees  [VT68]  except  where  no  confusion  would  result  by  abbreviation. 


A.l  Conventions 

Random  variables  are  denoted  by  capital  letters,  and  their  values  are  generally  de¬ 
noted  by  the  same  letter  in  lower  case. 

Vectors  (such  as  2D  image  and  model  features)  are  denoted  in  bold-face  lower  case, 
and  matrices  are  denoted  in  bold-face  upper  case. 

P{-}  denotes  the  probability  of  the  event  in  parentheses. 

Fx{x)  is  the  probability  that  random  variable  X  is  less  than  or  equal  to  x. 

fx{x)  is  the  probability  density  function  of  the  random  variable  X. 

A  vertical  line  in  an  expression  means  “given  that”.  So  for  example,  fx{i  \  E)  is  the 
probability  density  function  of  X  given  event  E.  If  the  event  being  conditioned  upon 
is  that  the  value  of  a  random  variable  A  =  a,  then  we  write  fx\A{x  \  «) 

E[-]  is  the  expected  value  of  the  random  variable  in  brackets. 

Var(-)  is  the  variance  of  the  random  variable  in  parentheses.  Cov(A,  V')  is  the  co- 
variance  between  the  random  variables  X  and  Y. 

X  ~  N{m,<r^)  denotes  that  the  random  variable  X  is  normally  distributed  with  mean 
m  and  variance 
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A. 2  Symbols  and  Constants 


n?  -f-  3  number  of  model  features 
n  -}-  3  number  of  image  ( -ensor)  features 
m,  ith  model  feature 

s,  ith  image  feature 

0  threshold  used  in  recognition  algorithm 

Pf  probability  of  false  positive  (false  detection) 

PjT)  probability  of  true  positive  (true  detection) 

(To  standard  deviation  of  sensor  noise  for  the  Gaussian  error  model. 

This  is  considered  to  be  a  constant  whose  value 
must  be  determined  empirically. 

Co  radius  of  sensor  noise  for  the  uniform  error  model. 

.4  image  area 

angles 
?,  j,  k  indices 

H  the  event  “three  feature  correspondence  is  correct” 

N  the  event  “three  feature  correspondence  is  incorrect”. 

A/  the  event  “image  feature  arises  from  model” 

A/  the  event  “image  feature  does  not  arise  from  model”,  or  alternatively, 

“image  feature  is  clutter” 


A. 3  Random  Variables 


Pe  ranges  over  the  values  of  the  expression 


+  (1  -  o  -  /?)^  +  1 

for  all  model  points,  where  (a,/?)  is  a  model  point’s  affine  coordinates  in  the 
coordinate  frame  established  by  the  three  model  p>oints  used  in  the  correspon¬ 
dence. 

<Tf  describes  the  standard  deviation  of  projected  Gaussian  error  disks  and  is  defined 


14/ (m,  n,  c,  <To)  describes  the  weight  or  score  distribution  of  a  single  point  arising  from  the  model 
for  the  specified  values  of  m,  n,  c  and  (Tq. 

^^/(  m,7<,c,<7o)  describes  the  weight  or  score  distribution  of  a  single  clutter  point  for  the  speci¬ 
fied  values  of  m,  n,  c  and  <To. 

Vff(m,  n,  c,  <7o)  describes  the  weight  or  score  distribution  of  an  entire  correct  hypothesis  for  the 
specified  values  of  m,  n,  c  and  <to- 
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W-ff{m,n,c,(TQ)  describes  the  weight  or  score  distribution  of  an  entire  incorrect  hypothesis  for 
the  specified  values  of  rrt,  n,  c  and  (Tq. 


For  simplicity,  the  last  four  random  variables  will  be  referred  to  as  Vm,  Vjf,  Wh  and 
Wjf,  since  the  values  of  the  parameters  m,  n,  c  and  <To  are  constant  for  the  scope  of 
the  discussion. 


A. 4  Functions  of  Random  Variables 


Let  A’  be  a  random  variable  with  probability  density  fx(^)>  and  let  Y  be  a  random 
variable  which  arises  as  a  function  of  A'^,  specifically,  Y  =  ^(A).  Assuming  the  func¬ 
tion  g  is  monotonically  increasing  and  differentiable,  the  probability  density  function 
for  the  random  variable  V'  is  given  as  follows  [BRB89]: 


fviy)  = 


fxjg  My)) 

9'i9~Hy)) 


(A.l) 


For  a  monotonically  decreasing  function,  the  formula  is  the  negation  of  the  above 
expression. 

Suppose  A  and  V'  are  jcLitly  random  variables.  Then  the  mean  and  variance  of  A"" 
can  be  found  by  conditioning  on  the  value  of  V’  [Ros84]: 

E[A]  =  E[EfA  i  V']]  (A.2) 


Var(A) 


E[Var(A  |  V')] -f  Var  (E[A  [  V]) 


(A.3) 
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